You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Divij Vaidya <di...@gmail.com> on 2023/02/14 15:15:55 UTC

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Hey Jun

It has been a while since this KIP got some attention. While we wait for
Satish to chime in here, perhaps I can answer your question.

> Could you explain how you exposed the log size in your KIP-405
implementation?

The APIs available in RLMM as per KIP405
are, addRemoteLogSegmentMetadata(), updateRemoteLogSegmentMetadata(),
remoteLogSegmentMetadata(), highestOffsetForEpoch(),
putRemotePartitionDeleteMetadata(), listRemoteLogSegments(),
onPartitionLeadershipChanges()
and onStopPartitions(). None of these APIs allow us to expose the log size,
hence, the only option that remains is to list all segments using
listRemoteLogSegments() and aggregate them every time we require to
calculate the size. Based on our prior discussion, this requires reading
all segment metadata which won't work for non-local RLMM implementations.
Satish's implementation also performs a full scan and calculates the
aggregate. see:
https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619


Does this answer your question?

--
Divij Vaidya



On Tue, Dec 20, 2022 at 8:40 PM Jun Rao <ju...@confluent.io.invalid> wrote:

> Hi, Divij,
>
> Thanks for the explanation.
>
> Good question.
>
> Hi, Satish,
>
> Could you explain how you exposed the log size in your KIP-405
> implementation?
>
> Thanks,
>
> Jun
>
> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <di...@gmail.com>
> wrote:
>
> > Hey Jun
> >
> > Yes, it is possible to maintain the log size in the cache (see rejected
> > alternative#3 in the KIP) but I did not understand how it is possible to
> > retrieve it without the new API. The log size could be calculated on
> > startup by scanning through the segments (though I would disagree that
> this
> > is the right approach since scanning itself takes order of minutes and
> > hence delay the start of archive process), and incrementally maintained
> > afterwards, even then, we would need an API in RemoteLogMetadataManager
> so
> > that RLM could fetch the cached size!
> >
> > If we wish to cache the size without adding a new API, then we need to
> > cache the size in RLM itself (instead of RLMM implementation) and
> > incrementally manage it. The downside of longer archive time at startup
> > (due to initial scale) still remains valid in this situation.
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao <ju...@confluent.io.invalid>
> wrote:
> >
> > > Hi, Divij,
> > >
> > > Thanks for the explanation.
> > >
> > > If there is in-memory cache, could we maintain the log size in the
> cache
> > > with the existing API? For example, a replica could make a
> > > listRemoteLogSegments(TopicIdPartition topicIdPartition) call on
> startup
> > to
> > > get the remote segment size before the current leaderEpoch. The leader
> > > could then maintain the size incrementally afterwards. On leader
> change,
> > > other replicas can make a listRemoteLogSegments(TopicIdPartition
> > > topicIdPartition, int leaderEpoch) call to get the size of newly
> > generated
> > > segments.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <di...@gmail.com>
> > > wrote:
> > >
> > > > > Is the new method enough for doing size-based retention?
> > > >
> > > > Yes. You are right in assuming that this API only provides the Remote
> > > > storage size (for current epoch chain). We would use this API for
> size
> > > > based retention along with a value of localOnlyLogSegmentSize which
> is
> > > > computed as Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size =
> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I have updated the
> > KIP
> > > > with this information. You can also check an example implementation
> at
> > > >
> > > >
> > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > >
> > > >
> > > > > Do you imagine all accesses to remote metadata will be across the
> > > network
> > > > or will there be some local in-memory cache?
> > > >
> > > > I would expect a disk-less implementation to maintain a finite
> > in-memory
> > > > cache for segment metadata to optimize the number of network calls
> made
> > > to
> > > > fetch the data. In future, we can think about bringing this finite
> size
> > > > cache into RLM itself but that's probably a conversation for a
> > different
> > > > KIP. There are many other things we would like to do to optimize the
> > > Tiered
> > > > storage interface such as introducing a circular buffer / streaming
> > > > interface from RSM (so that we don't have to wait to fetch the entire
> > > > segment before starting to send records to the consumer), caching the
> > > > segments fetched from RSM locally (I would assume all RSM plugin
> > > > implementations to do this, might as well add it to RLM) etc.
> > > >
> > > > --
> > > > Divij Vaidya
> > > >
> > > >
> > > >
> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao <ju...@confluent.io.invalid>
> > > wrote:
> > > >
> > > > > Hi, Divij,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > Is the new method enough for doing size-based retention? It gives
> the
> > > > total
> > > > > size of the remote segments, but it seems that we still don't know
> > the
> > > > > exact total size for a log since there could be overlapping
> segments
> > > > > between the remote and the local segments.
> > > > >
> > > > > You mentioned a disk-less implementation. Do you imagine all
> accesses
> > > to
> > > > > remote metadata will be across the network or will there be some
> > local
> > > > > in-memory cache?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> divijvaidya13@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > The method is needed for RLMM implementations which fetch the
> > > > information
> > > > > > over the network and not for the disk based implementations (such
> > as
> > > > the
> > > > > > default topic based RLMM).
> > > > > >
> > > > > > I would argue that adding this API makes the interface more
> generic
> > > > than
> > > > > > what it is today. This is because, with the current APIs an
> > > implementor
> > > > > is
> > > > > > restricted to use disk based RLMM solutions only (i.e. the
> default
> > > > > > solution) whereas if we add this new API, we unblock usage of
> > network
> > > > > based
> > > > > > RLMM implementations such as databases.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao <ju...@confluent.io.invalid>
> > > > wrote:
> > > > > >
> > > > > > > Hi, Divij,
> > > > > > >
> > > > > > > Thanks for the reply.
> > > > > > >
> > > > > > > Point#2. My high level question is that is the new method
> needed
> > > for
> > > > > > every
> > > > > > > implementation of remote storage or just for a specific
> > > > implementation.
> > > > > > The
> > > > > > > issues that you pointed out exist for the default
> implementation
> > of
> > > > > RLMM
> > > > > > as
> > > > > > > well and so far, the default implementation hasn't found a need
> > > for a
> > > > > > > similar new method. For public interface, ideally we want to
> make
> > > it
> > > > > more
> > > > > > > general.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
> > > > divijvaidya13@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thank you Jun and Alex for your comments.
> > > > > > > >
> > > > > > > > Point#1: You are right Jun. As Alex mentioned, the "derived
> > > > metadata"
> > > > > > can
> > > > > > > > increase the size of cached metadata by a factor of 10 but it
> > > > should
> > > > > be
> > > > > > > ok
> > > > > > > > to cache just the actual metadata. My point about size being
> a
> > > > > > limitation
> > > > > > > > for using cache is not valid anymore.
> > > > > > > >
> > > > > > > > Point#2: For a new replica, it would still have to fetch the
> > > > metadata
> > > > > > > over
> > > > > > > > the network to initiate the warm up of the cache and hence,
> > > > increase
> > > > > > the
> > > > > > > > start time of the archival process. Please also note the
> > > > > repercussions
> > > > > > of
> > > > > > > > the warm up scan that Alex mentioned in this thread as part
> of
> > > > > #102.2.
> > > > > > > >
> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that. My point about
> > > size
> > > > > > being
> > > > > > > a
> > > > > > > > limitation for using cache is not valid anymore.
> > > > > > > >
> > > > > > > > 101#: Alex, if I understand correctly, you are suggesting to
> > > cache
> > > > > the
> > > > > > > > total size at the leader and update it on archival. This
> > wouldn't
> > > > > work
> > > > > > > for
> > > > > > > > cases when the leader restarts where we would have to make a
> > full
> > > > > scan
> > > > > > > > to update the total size entry on startup. We expect users to
> > > store
> > > > > > data
> > > > > > > > over longer duration in remote storage which increases the
> > > > likelihood
> > > > > > of
> > > > > > > > leader restarts / failovers.
> > > > > > > >
> > > > > > > > 102#.1: I don't think that the current design accommodates
> the
> > > fact
> > > > > > that
> > > > > > > > data corruption could happen at the RLMM plugin (we don't
> have
> > > > > checksum
> > > > > > > as
> > > > > > > > a field in metadata as part of KIP405). If data corruption
> > > occurs,
> > > > w/
> > > > > > or
> > > > > > > > w/o the cache, it would be a different problem to solve. I
> > would
> > > > like
> > > > > > to
> > > > > > > > keep this outside the scope of this KIP.
> > > > > > > >
> > > > > > > > 102#.2: Agree. This remains as the main concern for using the
> > > cache
> > > > > to
> > > > > > > > fetch total size.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Divij Vaidya
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre Dupriez <
> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi Divij,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP. Please find some comments based on
> what I
> > > > read
> > > > > on
> > > > > > > > > this thread so far - apologies for the repeats and the late
> > > > reply.
> > > > > > > > >
> > > > > > > > > If I understand correctly, one of the main elements of
> > > discussion
> > > > > is
> > > > > > > > > about caching in Kafka versus delegation of providing the
> > > remote
> > > > > size
> > > > > > > > > of a topic-partition to the plugin.
> > > > > > > > >
> > > > > > > > > A few comments:
> > > > > > > > >
> > > > > > > > > 100. The size of the “derived metadata” which is managed by
> > the
> > > > > > plugin
> > > > > > > > > to represent an rlmMetadata can indeed be close to 1 kB on
> > > > average
> > > > > > > > > depending on its own internal structure, e.g. the
> redundancy
> > it
> > > > > > > > > enforces (unfortunately resulting to duplication),
> additional
> > > > > > > > > information such as checksums and primary and secondary
> > > indexable
> > > > > > > > > keys. But indeed, the rlmMetadata is itself a lighter data
> > > > > structure
> > > > > > > > > by a factor of 10. And indeed, instead of caching the
> > “derived
> > > > > > > > > metadata”, only the rlmMetadata could be, which should
> > address
> > > > the
> > > > > > > > > concern regarding the memory occupancy of the cache.
> > > > > > > > >
> > > > > > > > > 101. I am not sure I fully understand why we would need to
> > > cache
> > > > > the
> > > > > > > > > list of rlmMetadata to retain the remote size of a
> > > > topic-partition.
> > > > > > > > > Since the leader of a topic-partition is, in
> non-degenerated
> > > > cases,
> > > > > > > > > the only actor which can mutate the remote part of the
> > > > > > > > > topic-partition, hence its size, it could in theory only
> > cache
> > > > the
> > > > > > > > > size of the remote log once it has calculated it? In which
> > case
> > > > > there
> > > > > > > > > would not be any problem regarding the size of the caching
> > > > > strategy.
> > > > > > > > > Did I miss something there?
> > > > > > > > >
> > > > > > > > > 102. There may be a few challenges to consider with
> caching:
> > > > > > > > >
> > > > > > > > > 102.1) As mentioned above, the caching strategy assumes no
> > > > mutation
> > > > > > > > > outside the lifetime of a leader. While this is true in the
> > > > normal
> > > > > > > > > course of operation, there could be accidental mutation
> > outside
> > > > of
> > > > > > the
> > > > > > > > > leader and a loss of consistency between the cached state
> and
> > > the
> > > > > > > > > actual remote representation of the log. E.g. split-brain
> > > > > scenarios,
> > > > > > > > > bugs in the plugins, bugs in external systems with mutating
> > > > access
> > > > > on
> > > > > > > > > the derived metadata. In the worst case, a drift between
> the
> > > > cached
> > > > > > > > > size and the actual size could lead to over-deleting remote
> > > data
> > > > > > which
> > > > > > > > > is a durability risk.
> > > > > > > > >
> > > > > > > > > The alternative you propose, by making the plugin the
> source
> > of
> > > > > truth
> > > > > > > > > w.r.t. to the size of the remote log, can make it easier to
> > > avoid
> > > > > > > > > inconsistencies between plugin-managed metadata and the
> > remote
> > > > log
> > > > > > > > > from the perspective of Kafka. On the other hand, plugin
> > > vendors
> > > > > > would
> > > > > > > > > have to implement it with the expected efficiency to have
> it
> > > > yield
> > > > > > > > > benefits.
> > > > > > > > >
> > > > > > > > > 102.2) As you mentioned, the caching strategy in Kafka
> would
> > > > still
> > > > > > > > > require one iteration over the list of rlmMetadata when the
> > > > > > leadership
> > > > > > > > > of a topic-partition is assigned to a broker, while the
> > plugin
> > > > can
> > > > > > > > > offer alternative constant-time approaches. This
> calculation
> > > > cannot
> > > > > > be
> > > > > > > > > put on the LeaderAndIsr path and would be performed in the
> > > > > > background.
> > > > > > > > > In case of bulk leadership migration, listing the
> rlmMetadata
> > > > could
> > > > > > a)
> > > > > > > > > result in request bursts to any backend system the plugin
> may
> > > use
> > > > > > > > > [which shouldn’t be a problem for high-throughput data
> stores
> > > but
> > > > > > > > > could have cost implications] b) increase utilisation
> > timespan
> > > of
> > > > > the
> > > > > > > > > RLM threads for these calculations potentially leading to
> > > > transient
> > > > > > > > > starvation of tasks queued for, typically, offloading
> > > operations
> > > > c)
> > > > > > > > > could have a non-marginal CPU footprint on hardware with
> > strict
> > > > > > > > > resource constraints. All these elements could have an
> impact
> > > to
> > > > > some
> > > > > > > > > degree depending on the operational environment.
> > > > > > > > >
> > > > > > > > > From a design perspective, one question is where we want
> the
> > > > source
> > > > > > of
> > > > > > > > > truth w.r.t. remote log size to be during the lifetime of a
> > > > leader.
> > > > > > > > > The responsibility of maintaining a consistent
> representation
> > > of
> > > > > the
> > > > > > > > > remote log is shared by Kafka and the plugin. Which system
> is
> > > > best
> > > > > > > > > placed to maintain such a state while providing the highest
> > > > > > > > > consistency guarantees is something both Kafka and plugin
> > > > designers
> > > > > > > > > could help understand better.
> > > > > > > > >
> > > > > > > > > Many thanks,
> > > > > > > > > Alexandre
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > <jun@confluent.io.invalid
> > > >
> > > > a
> > > > > > > > écrit :
> > > > > > > > > >
> > > > > > > > > > Hi, Divij,
> > > > > > > > > >
> > > > > > > > > > Thanks for the reply.
> > > > > > > > > >
> > > > > > > > > > Point #1. Is the average remote segment metadata really
> > 1KB?
> > > > > What's
> > > > > > > > > listed
> > > > > > > > > > in the public interface is probably well below 100 bytes.
> > > > > > > > > >
> > > > > > > > > > Point #2. I guess you are assuming that each broker only
> > > caches
> > > > > the
> > > > > > > > > remote
> > > > > > > > > > segment metadata in memory. An alternative approach is to
> > > cache
> > > > > > them
> > > > > > > in
> > > > > > > > > > both memory and local disk. That way, on broker restart,
> > you
> > > > just
> > > > > > > need
> > > > > > > > to
> > > > > > > > > > fetch the new remote segments' metadata using the
> > > > > > > > > > listRemoteLogSegments(TopicIdPartition topicIdPartition,
> > int
> > > > > > > > leaderEpoch)
> > > > > > > > > > api. Will that work?
> > > > > > > > > >
> > > > > > > > > > Point #3. Thanks for the explanation and it sounds good.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij Vaidya <
> > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Jun
> > > > > > > > > > >
> > > > > > > > > > > There are three points that I would like to present
> here:
> > > > > > > > > > >
> > > > > > > > > > > 1. We would require a large cache size to efficiently
> > cache
> > > > all
> > > > > > > > segment
> > > > > > > > > > > metadata.
> > > > > > > > > > > 2. Linear scan of all metadata at broker startup to
> > > populate
> > > > > the
> > > > > > > > cache
> > > > > > > > > will
> > > > > > > > > > > be slow and will impact the archival process.
> > > > > > > > > > > 3. There is no other use case where a full scan of
> > segment
> > > > > > metadata
> > > > > > > > is
> > > > > > > > > > > required.
> > > > > > > > > > >
> > > > > > > > > > > Let's start by quantifying 1. Here's my estimate for
> the
> > > size
> > > > > of
> > > > > > > the
> > > > > > > > > cache.
> > > > > > > > > > > Average size of segment metadata = 1KB. This could be
> > more
> > > if
> > > > > we
> > > > > > > have
> > > > > > > > > > > frequent leader failover with a large number of leader
> > > epochs
> > > > > > being
> > > > > > > > > stored
> > > > > > > > > > > per segment.
> > > > > > > > > > > Segment size = 100MB. Users will prefer to reduce the
> > > segment
> > > > > > size
> > > > > > > > > from the
> > > > > > > > > > > default value of 1GB to ensure timely archival of data
> > > since
> > > > > data
> > > > > > > > from
> > > > > > > > > > > active segment is not archived.
> > > > > > > > > > > Cache size = num segments * avg. segment metadata size
> =
> > > > > > > > > (100TB/100MB)*1KB
> > > > > > > > > > > = 1GB.
> > > > > > > > > > > While 1GB for cache may not sound like a large number
> for
> > > > > larger
> > > > > > > > > machines,
> > > > > > > > > > > it does eat into the memory as an additional cache and
> > > makes
> > > > > use
> > > > > > > > cases
> > > > > > > > > with
> > > > > > > > > > > large data retention with low throughout expensive
> (where
> > > > such
> > > > > > use
> > > > > > > > case
> > > > > > > > > > > would could use smaller machines).
> > > > > > > > > > >
> > > > > > > > > > > About point#2:
> > > > > > > > > > > Even if we say that all segment metadata can fit into
> the
> > > > > cache,
> > > > > > we
> > > > > > > > > will
> > > > > > > > > > > need to populate the cache on broker startup. It would
> > not
> > > be
> > > > > in
> > > > > > > the
> > > > > > > > > > > critical patch of broker startup and hence won't impact
> > the
> > > > > > startup
> > > > > > > > > time.
> > > > > > > > > > > But it will impact the time when we could start the
> > > archival
> > > > > > > process
> > > > > > > > > since
> > > > > > > > > > > the RLM thread pool will be blocked on the first call
> to
> > > > > > > > > > > listRemoteLogSegments(). To scan metadata for 1MM
> > segments
> > > > > > > (computed
> > > > > > > > > above)
> > > > > > > > > > > and transfer 1GB data over the network from a RLMM such
> > as
> > > a
> > > > > > remote
> > > > > > > > > > > database would be in the order of minutes (depending on
> > how
> > > > > > > efficient
> > > > > > > > > the
> > > > > > > > > > > scan is with the RLMM implementation). Although, I
> would
> > > > > concede
> > > > > > > that
> > > > > > > > > > > having RLM threads blocked for a few minutes is perhaps
> > OK
> > > > but
> > > > > if
> > > > > > > we
> > > > > > > > > > > introduce the new API proposed in the KIP, we would
> have
> > a
> > > > > > > > > > > deterministic startup time for RLM. Adding the API
> comes
> > > at a
> > > > > low
> > > > > > > > cost
> > > > > > > > > and
> > > > > > > > > > > I believe the trade off is worth it.
> > > > > > > > > > >
> > > > > > > > > > > About point#3:
> > > > > > > > > > > We can use listRemoteLogSegments(TopicIdPartition
> > > > > > topicIdPartition,
> > > > > > > > int
> > > > > > > > > > > leaderEpoch) to calculate the segments eligible for
> > > deletion
> > > > > > (based
> > > > > > > > on
> > > > > > > > > size
> > > > > > > > > > > retention) where leader epoch(s) belong to the current
> > > leader
> > > > > > epoch
> > > > > > > > > chain.
> > > > > > > > > > > I understand that it may lead to segments belonging to
> > > other
> > > > > > epoch
> > > > > > > > > lineage
> > > > > > > > > > > not getting deleted and would require a separate
> > mechanism
> > > to
> > > > > > > delete
> > > > > > > > > them.
> > > > > > > > > > > The separate mechanism would anyways be required to
> > delete
> > > > > these
> > > > > > > > > "leaked"
> > > > > > > > > > > segments as there are other cases which could lead to
> > leaks
> > > > > such
> > > > > > as
> > > > > > > > > network
> > > > > > > > > > > problems with RSM mid way writing through. segment etc.
> > > > > > > > > > >
> > > > > > > > > > > Thank you for the replies so far. They have made me
> > > re-think
> > > > my
> > > > > > > > > assumptions
> > > > > > > > > > > and this dialogue has been very constructive for me.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
> > > > > > <jun@confluent.io.invalid
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi, Divij,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the reply.
> > > > > > > > > > > >
> > > > > > > > > > > > It's true that the data in Kafka could be kept longer
> > > with
> > > > > > > KIP-405.
> > > > > > > > > How
> > > > > > > > > > > > much data do you envision to have per broker? For
> 100TB
> > > > data
> > > > > > per
> > > > > > > > > broker,
> > > > > > > > > > > > with 1GB segment and segment metadata of 100 bytes,
> it
> > > > > requires
> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit in memory.
> > > > > > > > > > > >
> > > > > > > > > > > > RemoteLogMetadataManager has two
> > listRemoteLogSegments()
> > > > > > methods.
> > > > > > > > > The one
> > > > > > > > > > > > you listed listRemoteLogSegments(TopicIdPartition
> > > > > > > topicIdPartition,
> > > > > > > > > int
> > > > > > > > > > > > leaderEpoch) does return data in offset order.
> However,
> > > the
> > > > > > other
> > > > > > > > > > > > one listRemoteLogSegments(TopicIdPartition
> > > > topicIdPartition)
> > > > > > > > doesn't
> > > > > > > > > > > > specify the return order. I assume that you need the
> > > latter
> > > > > to
> > > > > > > > > calculate
> > > > > > > > > > > > the segment size?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Jun
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij Vaidya <
> > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > *Jun,*
> > > > > > > > > > > > >
> > > > > > > > > > > > > *"the default implementation of RLMM does local
> > > caching,
> > > > > > > right?"*
> > > > > > > > > > > > > Yes, Jun. The default implementation of RLMM does
> > > indeed
> > > > > > cache
> > > > > > > > the
> > > > > > > > > > > > segment
> > > > > > > > > > > > > metadata today, hence, it won't work for use cases
> > when
> > > > the
> > > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > > > segments in remote storage is large enough to
> exceed
> > > the
> > > > > size
> > > > > > > of
> > > > > > > > > cache.
> > > > > > > > > > > > As
> > > > > > > > > > > > > part of this KIP, I will implement the new proposed
> > API
> > > > in
> > > > > > the
> > > > > > > > > default
> > > > > > > > > > > > > implementation of RLMM but the underlying
> > > implementation
> > > > > will
> > > > > > > > > still be
> > > > > > > > > > > a
> > > > > > > > > > > > > scan. I will pick up optimizing that in a separate
> > PR.
> > > > > > > > > > > > >
> > > > > > > > > > > > > *"we also cache all segment metadata in the brokers
> > > > without
> > > > > > > > > KIP-405. Do
> > > > > > > > > > > > you
> > > > > > > > > > > > > see a need to change that?"*
> > > > > > > > > > > > > Please correct me if I am wrong here but we cache
> > > > metadata
> > > > > > for
> > > > > > > > > segments
> > > > > > > > > > > > > "residing in local storage". The size of the
> current
> > > > cache
> > > > > > > works
> > > > > > > > > fine
> > > > > > > > > > > for
> > > > > > > > > > > > > the scale of the number of segments that we expect
> to
> > > > store
> > > > > > in
> > > > > > > > > local
> > > > > > > > > > > > > storage. After KIP-405, that cache will continue to
> > > store
> > > > > > > > metadata
> > > > > > > > > for
> > > > > > > > > > > > > segments which are residing in local storage and
> > hence,
> > > > we
> > > > > > > don't
> > > > > > > > > need
> > > > > > > > > > > to
> > > > > > > > > > > > > change that. For segments which have been offloaded
> > to
> > > > > remote
> > > > > > > > > storage,
> > > > > > > > > > > it
> > > > > > > > > > > > > would rely on RLMM. Note that the scale of data
> > stored
> > > in
> > > > > > RLMM
> > > > > > > is
> > > > > > > > > > > > different
> > > > > > > > > > > > > from local cache because the number of segments is
> > > > expected
> > > > > > to
> > > > > > > be
> > > > > > > > > much
> > > > > > > > > > > > > larger than what current implementation stores in
> > local
> > > > > > > storage.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2,3,4:
> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > does
> > > > > > > > > specify
> > > > > > > > > > > the
> > > > > > > > > > > > > order i.e. it returns the segments sorted by first
> > > offset
> > > > > in
> > > > > > > > > ascending
> > > > > > > > > > > > > order. I am copying the API docs for KIP-405 here
> for
> > > > your
> > > > > > > > > reference
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Returns iterator of remote log segment metadata,
> > > sorted
> > > > by
> > > > > > > > {@link
> > > > > > > > > > > > > RemoteLogSegmentMetadata#startOffset()} inascending
> > > order
> > > > > > which
> > > > > > > > > > > contains
> > > > > > > > > > > > > the given leader epoch. This is used by remote log
> > > > > retention
> > > > > > > > > management
> > > > > > > > > > > > > subsystemto fetch the segment metadata for a given
> > > leader
> > > > > > > > > epoch.@param
> > > > > > > > > > > > > topicIdPartition topic partition@param leaderEpoch
> > > > > > leader
> > > > > > > > > > > > > epoch@return
> > > > > > > > > > > > > Iterator of remote segments, sorted by start offset
> > in
> > > > > > > ascending
> > > > > > > > > > > order. *
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Luke,*
> > > > > > > > > > > > >
> > > > > > > > > > > > > 5. Note that we are trying to optimize the
> efficiency
> > > of
> > > > > size
> > > > > > > > based
> > > > > > > > > > > > > retention for remote storage. KIP-405 does not
> > > introduce
> > > > a
> > > > > > new
> > > > > > > > > config
> > > > > > > > > > > for
> > > > > > > > > > > > > periodically checking remote similar to
> > > > > > > > > > > log.retention.check.interval.ms
> > > > > > > > > > > > > which is applicable for remote storage. Hence, the
> > > metric
> > > > > > will
> > > > > > > be
> > > > > > > > > > > updated
> > > > > > > > > > > > > at the time of invoking log retention check for
> > remote
> > > > tier
> > > > > > > which
> > > > > > > > > is
> > > > > > > > > > > > > pending implementation today. We can perhaps come
> > back
> > > > and
> > > > > > > update
> > > > > > > > > the
> > > > > > > > > > > > > metric description after the implementation of log
> > > > > retention
> > > > > > > > check
> > > > > > > > > in
> > > > > > > > > > > > > RemoteLogManager.
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke Chen <
> > > > > showuon@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Divij,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > One more question about the metric:
> > > > > > > > > > > > > > I think the metric will be updated when
> > > > > > > > > > > > > > (1) each time we run the log retention check
> (that
> > > is,
> > > > > > > > > > > > > > log.retention.check.interval.ms)
> > > > > > > > > > > > > > (2) When user explicitly call getRemoteLogSize
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Is that correct?
> > > > > > > > > > > > > > Maybe we should add a note in metric description,
> > > > > > otherwise,
> > > > > > > > when
> > > > > > > > > > > user
> > > > > > > > > > > > > got,
> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes, will be
> > surprised.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Otherwise, LGTM
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you for the KIP
> > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun Rao
> > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the explanation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. Hmm, the default implementation of RLMM does
> > > local
> > > > > > > > caching,
> > > > > > > > > > > right?
> > > > > > > > > > > > > > > Currently, we also cache all segment metadata
> in
> > > the
> > > > > > > brokers
> > > > > > > > > > > without
> > > > > > > > > > > > > > > KIP-405. Do you see a need to change that?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation makes sense.
> > However,
> > > > > > > > > > > > > > > currently,
> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > doesn't
> > > > > > > > > > > > > > specify
> > > > > > > > > > > > > > > a particular order of the iterator. Do you
> intend
> > > to
> > > > > > change
> > > > > > > > > that?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM Divij Vaidya <
> > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hey Jun
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you for your comments.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > *1. "RLMM implementor could ensure that
> > > > > > > > > listRemoteLogSegments()
> > > > > > > > > > > is
> > > > > > > > > > > > > > fast"*
> > > > > > > > > > > > > > > > This would be ideal but pragmatically, it is
> > > > > difficult
> > > > > > to
> > > > > > > > > ensure
> > > > > > > > > > > > that
> > > > > > > > > > > > > > > > listRemoteLogSegments() is fast. This is
> > because
> > > of
> > > > > the
> > > > > > > > > > > possibility
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > large number of segments (much larger than
> what
> > > > Kafka
> > > > > > > > > currently
> > > > > > > > > > > > > handles
> > > > > > > > > > > > > > > > with local storage today) would make it
> > > infeasible
> > > > to
> > > > > > > adopt
> > > > > > > > > > > > > strategies
> > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > as local caching to improve the performance
> of
> > > > > > > > > > > > listRemoteLogSegments.
> > > > > > > > > > > > > > > Apart
> > > > > > > > > > > > > > > > from caching (which won't work due to size
> > > > > > limitations) I
> > > > > > > > > can't
> > > > > > > > > > > > think
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > other strategies which may eliminate the need
> > for
> > > > IO
> > > > > > > > > > > > > > > > operations proportional to the number of
> total
> > > > > > segments.
> > > > > > > > > Please
> > > > > > > > > > > > > advise
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > you have something in mind.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2.  "*If the size exceeds the retention size,
> > we
> > > > need
> > > > > > to
> > > > > > > > > > > determine
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > subset of segments to delete to bring the
> size
> > > > within
> > > > > > the
> > > > > > > > > > > retention
> > > > > > > > > > > > > > > limit.
> > > > > > > > > > > > > > > > Do we need to call
> > > > > > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > determine that?"*
> > > > > > > > > > > > > > > > Yes, we need to call listRemoteLogSegments()
> to
> > > > > > determine
> > > > > > > > > which
> > > > > > > > > > > > > > segments
> > > > > > > > > > > > > > > > should be deleted. But there is a difference
> > with
> > > > the
> > > > > > use
> > > > > > > > > case we
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > > trying to optimize with this KIP. To
> determine
> > > the
> > > > > > subset
> > > > > > > > of
> > > > > > > > > > > > segments
> > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > would be deleted, we only read metadata for
> > > > segments
> > > > > > > which
> > > > > > > > > would
> > > > > > > > > > > be
> > > > > > > > > > > > > > > deleted
> > > > > > > > > > > > > > > > via the listRemoteLogSegments(). But to
> > determine
> > > > the
> > > > > > > > > > > totalLogSize,
> > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > is required every time retention logic based
> on
> > > > size
> > > > > > > > > executes, we
> > > > > > > > > > > > > read
> > > > > > > > > > > > > > > > metadata of *all* the segments in remote
> > storage.
> > > > > > Hence,
> > > > > > > > the
> > > > > > > > > > > number
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > results returned by
> > > > > > > > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > > > > > *is
> > > > > > > > > > > > > > > > different when we are calculating
> totalLogSize
> > > vs.
> > > > > when
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > > > > > > > determining
> > > > > > > > > > > > > > > > the subset of segments to delete.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 3.
> > > > > > > > > > > > > > > > *"Also, what about time-based retention? To
> > make
> > > > that
> > > > > > > > > efficient,
> > > > > > > > > > > do
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > need
> > > > > > > > > > > > > > > > to make some additional interface
> changes?"*No.
> > > > Note
> > > > > > that
> > > > > > > > > time
> > > > > > > > > > > > > > complexity
> > > > > > > > > > > > > > > > to determine the segments for retention is
> > > > different
> > > > > > for
> > > > > > > > time
> > > > > > > > > > > based
> > > > > > > > > > > > > vs.
> > > > > > > > > > > > > > > > size based. For time based, the time
> complexity
> > > is
> > > > a
> > > > > > > > > function of
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > number
> > > > > > > > > > > > > > > > of segments which are "eligible for deletion"
> > > > (since
> > > > > we
> > > > > > > > only
> > > > > > > > > read
> > > > > > > > > > > > > > > metadata
> > > > > > > > > > > > > > > > for segments which would be deleted) whereas
> in
> > > > size
> > > > > > > based
> > > > > > > > > > > > retention,
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > time complexity is a function of "all
> segments"
> > > > > > available
> > > > > > > > in
> > > > > > > > > > > remote
> > > > > > > > > > > > > > > storage
> > > > > > > > > > > > > > > > (metadata of all segments needs to be read to
> > > > > calculate
> > > > > > > the
> > > > > > > > > total
> > > > > > > > > > > > > > size).
> > > > > > > > > > > > > > > As
> > > > > > > > > > > > > > > > you may observe, this KIP will bring the time
> > > > > > complexity
> > > > > > > > for
> > > > > > > > > both
> > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > based retention & size based retention to the
> > > same
> > > > > > > > function.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 4. Also, please note that this new API
> > introduced
> > > > in
> > > > > > this
> > > > > > > > KIP
> > > > > > > > > > > also
> > > > > > > > > > > > > > > enables
> > > > > > > > > > > > > > > > us to provide a metric for total size of data
> > > > stored
> > > > > in
> > > > > > > > > remote
> > > > > > > > > > > > > storage.
> > > > > > > > > > > > > > > > Without the API, calculation of this metric
> > will
> > > > > become
> > > > > > > > very
> > > > > > > > > > > > > expensive
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> > > > > > > > > > > > > > > > I understand that your motivation here is to
> > > avoid
> > > > > > > > polluting
> > > > > > > > > the
> > > > > > > > > > > > > > > interface
> > > > > > > > > > > > > > > > with optimization specific APIs and I will
> > agree
> > > > with
> > > > > > > that
> > > > > > > > > goal.
> > > > > > > > > > > > But
> > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > believe that this new API proposed in the KIP
> > > > brings
> > > > > in
> > > > > > > > > > > significant
> > > > > > > > > > > > > > > > improvement and there is no other work around
> > > > > available
> > > > > > > to
> > > > > > > > > > > achieve
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > performance.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM Jun Rao
> > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry for the late
> reply.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The motivation of the KIP is to improve the
> > > > > > efficiency
> > > > > > > of
> > > > > > > > > size
> > > > > > > > > > > > > based
> > > > > > > > > > > > > > > > > retention. I am not sure the proposed
> changes
> > > are
> > > > > > > enough.
> > > > > > > > > For
> > > > > > > > > > > > > > example,
> > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > the size exceeds the retention size, we
> need
> > to
> > > > > > > determine
> > > > > > > > > the
> > > > > > > > > > > > > subset
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > segments to delete to bring the size within
> > the
> > > > > > > retention
> > > > > > > > > > > limit.
> > > > > > > > > > > > Do
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > need
> > > > > > > > > > > > > > > > > to call
> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > to
> > > > > > > > > > > > > determine
> > > > > > > > > > > > > > > > that?
> > > > > > > > > > > > > > > > > Also, what about time-based retention? To
> > make
> > > > that
> > > > > > > > > efficient,
> > > > > > > > > > > do
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > need
> > > > > > > > > > > > > > > > > to make some additional interface changes?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > An alternative approach is for the RLMM
> > > > implementor
> > > > > > to
> > > > > > > > make
> > > > > > > > > > > sure
> > > > > > > > > > > > > > > > > that
> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > is
> > > > > > > > > fast
> > > > > > > > > > > > > (e.g.,
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > local caching). This way, we could keep the
> > > > > interface
> > > > > > > > > simple.
> > > > > > > > > > > > Have
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > considered that?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28 AM Divij
> Vaidya
> > <
> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Does anyone else have any thoughts on
> this
> > > > > before I
> > > > > > > > > propose
> > > > > > > > > > > > this
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > vote?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57 PM Satish
> > > Duggana
> > > > <
> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for the KIP Divij!
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > This is a nice improvement to avoid
> > > > > recalculation
> > > > > > > of
> > > > > > > > > size.
> > > > > > > > > > > > > > > Customized
> > > > > > > > > > > > > > > > > > RLMMs
> > > > > > > > > > > > > > > > > > > can implement the best possible
> approach
> > by
> > > > > > caching
> > > > > > > > or
> > > > > > > > > > > > > > maintaining
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > size
> > > > > > > > > > > > > > > > > > > in an efficient way. But this is not a
> > big
> > > > > > concern
> > > > > > > > for
> > > > > > > > > the
> > > > > > > > > > > > > > default
> > > > > > > > > > > > > > > > > topic
> > > > > > > > > > > > > > > > > > > based RLMM as mentioned in the KIP.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > ~Satish.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at 18:48, Divij
> > Vaidya
> > > <
> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thank you for your review Luke.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Reg: is that would the new
> > > > > > `RemoteLogSizeBytes`
> > > > > > > > > metric
> > > > > > > > > > > > be a
> > > > > > > > > > > > > > > > > > performance
> > > > > > > > > > > > > > > > > > > > overhead? Although we move the
> > > calculation
> > > > > to a
> > > > > > > > > seperate
> > > > > > > > > > > > API,
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > still
> > > > > > > > > > > > > > > > > > > > can't assume users will implement a
> > > > > > light-weight
> > > > > > > > > method,
> > > > > > > > > > > > > right?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > This metric would be logged using the
> > > > > > information
> > > > > > > > > that is
> > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > being
> > > > > > > > > > > > > > > > > > > > calculated for handling remote
> > retention
> > > > > logic,
> > > > > > > > > hence, no
> > > > > > > > > > > > > > > > additional
> > > > > > > > > > > > > > > > > > work
> > > > > > > > > > > > > > > > > > > > is required to calculate this metric.
> > > More
> > > > > > > > > specifically,
> > > > > > > > > > > > > > whenever
> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls
> getRemoteLogSize
> > > > API,
> > > > > > this
> > > > > > > > > metric
> > > > > > > > > > > > > would
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > captured.
> > > > > > > > > > > > > > > > > > > > This API call is made every time
> > > > > > RemoteLogManager
> > > > > > > > > wants
> > > > > > > > > > > to
> > > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > > > expired
> > > > > > > > > > > > > > > > > > > > remote log segments (which should be
> > > > > periodic).
> > > > > > > > Does
> > > > > > > > > that
> > > > > > > > > > > > > > address
> > > > > > > > > > > > > > > > > your
> > > > > > > > > > > > > > > > > > > > concern?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at 11:01 AM Luke
> > > Chen
> > > > <
> > > > > > > > > > > > > showuon@gmail.com>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP!
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I think it makes sense to delegate
> > the
> > > > > > > > > responsibility
> > > > > > > > > > > of
> > > > > > > > > > > > > > > > > calculation
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > specific RemoteLogMetadataManager
> > > > > > > implementation.
> > > > > > > > > > > > > > > > > > > > > But one thing I'm not quite sure,
> is
> > > that
> > > > > > would
> > > > > > > > > the new
> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes` metric be a
> > > > > performance
> > > > > > > > > overhead?
> > > > > > > > > > > > > > > > > > > > > Although we move the calculation
> to a
> > > > > > seperate
> > > > > > > > > API, we
> > > > > > > > > > > > > still
> > > > > > > > > > > > > > > > can't
> > > > > > > > > > > > > > > > > > > assume
> > > > > > > > > > > > > > > > > > > > > users will implement a light-weight
> > > > method,
> > > > > > > > right?
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at 5:47 PM
> Divij
> > > > > Vaidya <
> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Please take a look at this KIP
> > which
> > > > > > proposes
> > > > > > > > an
> > > > > > > > > > > > > extension
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > KIP-405.
> > > > > > > > > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > > > > > > > is my first KIP with Apache Kafka
> > > > > community
> > > > > > > so
> > > > > > > > > any
> > > > > > > > > > > > > feedback
> > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > highly
> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > > > > > > > > > > > > > Sr. Software Engineer
> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Luke Chen <sh...@gmail.com>.
Hi Divij,

One minor comment:
remoteLogSize takes 2 parameters, but in the code snippet, you only provide
1 parameter.

Otherwise, LGTM

Thank you.
Luke

On Wed, Jul 12, 2023 at 8:56 PM Divij Vaidya <di...@gmail.com>
wrote:

> Jorge,
> About API name: Good point. I have changed it to remoteLogSize instead of
> getRemoteLogSize
>
> About partition tag in the metric: We don't use partition tag across any of
> the RemoteStorage metrics and I would like to keep this metric aligned with
> the rest. I will change the metric though to type=BrokerTopicMetrics
> instead of type=RemoteLogManager, since this is topic level information and
> not specific to RemoteLogManager.
>
>
> Satish,
> Ah yes! Updated from "This would increase the broker start-up time." to
> "This would increase the bootstrap time for the remote storage thread pool
> before the first eligible segment is archived."
>
> --
> Divij Vaidya
>
>
>
> On Mon, Jul 3, 2023 at 2:07 PM Satish Duggana <sa...@gmail.com>
> wrote:
>
> > Thanks Divij for taking the feedback and updating the motivation
> > section in the KIP.
> >
> > One more comment on Alternative solution-3, The con is not valid as
> > that will not affect the broker restart times as discussed in the
> > earlier email in this thread. You may want to update that.
> >
> > ~Satish.
> >
> > On Sun, 2 Jul 2023 at 01:03, Divij Vaidya <di...@gmail.com>
> wrote:
> > >
> > > Thank you folks for reviewing this KIP.
> > >
> > > Satish, I have modified the motivation to make it more clear. Now it
> > says,
> > > "Since the main feature of tiered storage is storing a large amount of
> > > data, we expect num_remote_segments to be large. A frequent linear scan
> > > (i.e. listing all segment metadata) could be expensive/slower because
> of
> > > the underlying storage used by RemoteLogMetadataManager. This slowness
> to
> > > list all segment metadata could result in the loss of availability...."
> > >
> > > Jun, Kamal, Satish, if you don't have any further concerns, I would
> > > appreciate a vote for this KIP in the voting thread -
> > > https://lists.apache.org/thread/soz00990gvzodv7oyqj4ysvktrqy6xfk
> > >
> > > --
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Sat, Jul 1, 2023 at 6:16 AM Kamal Chandraprakash <
> > > kamal.chandraprakash@gmail.com> wrote:
> > >
> > > > Hi Divij,
> > > >
> > > > Thanks for the explanation. LGTM.
> > > >
> > > > --
> > > > Kamal
> > > >
> > > > On Sat, Jul 1, 2023 at 7:28 AM Satish Duggana <
> > satish.duggana@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Divij,
> > > > > I am fine with having an API to compute the size as I mentioned in
> my
> > > > > earlier reply in this mail thread. But I have the below comment for
> > > > > the motivation for this KIP.
> > > > >
> > > > > As you discussed offline, the main issue here is listing calls for
> > > > > remote log segment metadata is slower because of the storage used
> for
> > > > > RLMM. These can be avoided with this new API.
> > > > >
> > > > > Please add this in the motivation section as it is one of the main
> > > > > motivations for the KIP.
> > > > >
> > > > > Thanks,
> > > > > Satish.
> > > > >
> > > > > On Sat, 1 Jul 2023 at 01:43, Jun Rao <ju...@confluent.io.invalid>
> > wrote:
> > > > > >
> > > > > > Hi, Divij,
> > > > > >
> > > > > > Sorry for the late reply.
> > > > > >
> > > > > > Given your explanation, the new API sounds reasonable to me. Is
> > that
> > > > > enough
> > > > > > to build the external metadata layer for the remote segments or
> do
> > you
> > > > > need
> > > > > > some additional API changes?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <
> > divijvaidya13@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Thank you for looking into this Kamal.
> > > > > > >
> > > > > > > You are right in saying that a cold start (i.e. leadership
> > failover
> > > > or
> > > > > > > broker startup) does not impact the broker startup duration.
> But
> > it
> > > > > does
> > > > > > > have the following impact:
> > > > > > > 1. It leads to a burst of full-scan requests to RLMM in case
> > multiple
> > > > > > > leadership failovers occur at the same time. Even if the RLMM
> > > > > > > implementation has the capability to serve the total size from
> an
> > > > index
> > > > > > > (and hence handle this burst), we wouldn't be able to use it
> > since
> > > > the
> > > > > > > current API necessarily calls for a full scan.
> > > > > > > 2. The archival (copying of data to tiered storage) process
> will
> > > > have a
> > > > > > > delayed start. The delayed start of archival could lead to
> local
> > > > build
> > > > > up
> > > > > > > of data which may lead to disk full.
> > > > > > >
> > > > > > > The disadvantage of adding this new API is that every provider
> > will
> > > > > have to
> > > > > > > implement it, agreed. But I believe that this tradeoff is
> > worthwhile
> > > > > since
> > > > > > > the default implementation could be the same as you mentioned,
> > i.e.
> > > > > keeping
> > > > > > > cumulative in-memory count.
> > > > > > >
> > > > > > > --
> > > > > > > Divij Vaidya
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
> > > > > > > kamal.chandraprakash@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi Divij,
> > > > > > > >
> > > > > > > > Thanks for the KIP! Sorry for the late reply.
> > > > > > > >
> > > > > > > > Can you explain the rejected alternative-3?
> > > > > > > > Store the cumulative size of remote tier log in-memory at
> > > > > > > RemoteLogManager
> > > > > > > > "*Cons*: Every time a broker starts-up, it will scan through
> > all
> > > > the
> > > > > > > > segments in the remote tier to initialise the in-memory
> value.
> > This
> > > > > would
> > > > > > > > increase the broker start-up time."
> > > > > > > >
> > > > > > > > Keeping the source of truth to determine the remote-log-size
> > in the
> > > > > > > leader
> > > > > > > > would be consistent across different implementations of the
> > plugin.
> > > > > The
> > > > > > > > concern posted in the KIP is that we are calculating the
> > > > > remote-log-size
> > > > > > > on
> > > > > > > > each iteration of the cleaner thread (say 5 mins). If we
> > calculate
> > > > > only
> > > > > > > > once during broker startup or during the leadership
> > reassignment,
> > > > do
> > > > > we
> > > > > > > > still need the cache?
> > > > > > > >
> > > > > > > > The broker startup-time won't be affected by the remote log
> > manager
> > > > > > > > initialisation. The broker continue to start accepting the
> new
> > > > > > > > produce/fetch requests, while the RLM thread in the
> background
> > can
> > > > > > > > determine the remote-log-size once and start copying/deleting
> > the
> > > > > > > segments.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Kamal
> > > > > > > >
> > > > > > > > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <
> > > > divijvaidya13@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Satish / Jun
> > > > > > > > >
> > > > > > > > > Do you have any thoughts on this?
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Divij Vaidya
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <
> > > > > divijvaidya13@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hey Jun
> > > > > > > > > >
> > > > > > > > > > It has been a while since this KIP got some attention.
> > While we
> > > > > wait
> > > > > > > > for
> > > > > > > > > > Satish to chime in here, perhaps I can answer your
> > question.
> > > > > > > > > >
> > > > > > > > > > > Could you explain how you exposed the log size in your
> > > > KIP-405
> > > > > > > > > > implementation?
> > > > > > > > > >
> > > > > > > > > > The APIs available in RLMM as per KIP405
> > > > > > > > > > are, addRemoteLogSegmentMetadata(),
> > > > > updateRemoteLogSegmentMetadata(),
> > > > > > > > > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > > > > > > > > putRemotePartitionDeleteMetadata(),
> listRemoteLogSegments(),
> > > > > > > > > onPartitionLeadershipChanges()
> > > > > > > > > > and onStopPartitions(). None of these APIs allow us to
> > expose
> > > > > the log
> > > > > > > > > size,
> > > > > > > > > > hence, the only option that remains is to list all
> segments
> > > > using
> > > > > > > > > > listRemoteLogSegments() and aggregate them every time we
> > > > require
> > > > > to
> > > > > > > > > > calculate the size. Based on our prior discussion, this
> > > > requires
> > > > > > > > reading
> > > > > > > > > > all segment metadata which won't work for non-local RLMM
> > > > > > > > implementations.
> > > > > > > > > > Satish's implementation also performs a full scan and
> > > > calculates
> > > > > the
> > > > > > > > > > aggregate. see:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Does this answer your question?
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Divij Vaidya
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> Hi, Divij,
> > > > > > > > > >>
> > > > > > > > > >> Thanks for the explanation.
> > > > > > > > > >>
> > > > > > > > > >> Good question.
> > > > > > > > > >>
> > > > > > > > > >> Hi, Satish,
> > > > > > > > > >>
> > > > > > > > > >> Could you explain how you exposed the log size in your
> > KIP-405
> > > > > > > > > >> implementation?
> > > > > > > > > >>
> > > > > > > > > >> Thanks,
> > > > > > > > > >>
> > > > > > > > > >> Jun
> > > > > > > > > >>
> > > > > > > > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <
> > > > > > > divijvaidya13@gmail.com
> > > > > > > > >
> > > > > > > > > >> wrote:
> > > > > > > > > >>
> > > > > > > > > >> > Hey Jun
> > > > > > > > > >> >
> > > > > > > > > >> > Yes, it is possible to maintain the log size in the
> > cache
> > > > (see
> > > > > > > > > rejected
> > > > > > > > > >> > alternative#3 in the KIP) but I did not understand how
> > it is
> > > > > > > > possible
> > > > > > > > > to
> > > > > > > > > >> > retrieve it without the new API. The log size could be
> > > > > calculated
> > > > > > > on
> > > > > > > > > >> > startup by scanning through the segments (though I
> would
> > > > > disagree
> > > > > > > > that
> > > > > > > > > >> this
> > > > > > > > > >> > is the right approach since scanning itself takes
> order
> > of
> > > > > minutes
> > > > > > > > and
> > > > > > > > > >> > hence delay the start of archive process), and
> > incrementally
> > > > > > > > > maintained
> > > > > > > > > >> > afterwards, even then, we would need an API in
> > > > > > > > > RemoteLogMetadataManager
> > > > > > > > > >> so
> > > > > > > > > >> > that RLM could fetch the cached size!
> > > > > > > > > >> >
> > > > > > > > > >> > If we wish to cache the size without adding a new API,
> > then
> > > > we
> > > > > > > need
> > > > > > > > to
> > > > > > > > > >> > cache the size in RLM itself (instead of RLMM
> > > > implementation)
> > > > > and
> > > > > > > > > >> > incrementally manage it. The downside of longer
> archive
> > time
> > > > > at
> > > > > > > > > startup
> > > > > > > > > >> > (due to initial scale) still remains valid in this
> > > > situation.
> > > > > > > > > >> >
> > > > > > > > > >> > --
> > > > > > > > > >> > Divij Vaidya
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao
> > > > > <jun@confluent.io.invalid
> > > > > > > >
> > > > > > > > > >> wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > > Hi, Divij,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thanks for the explanation.
> > > > > > > > > >> > >
> > > > > > > > > >> > > If there is in-memory cache, could we maintain the
> log
> > > > size
> > > > > in
> > > > > > > the
> > > > > > > > > >> cache
> > > > > > > > > >> > > with the existing API? For example, a replica could
> > make a
> > > > > > > > > >> > > listRemoteLogSegments(TopicIdPartition
> > topicIdPartition)
> > > > > call on
> > > > > > > > > >> startup
> > > > > > > > > >> > to
> > > > > > > > > >> > > get the remote segment size before the current
> > > > leaderEpoch.
> > > > > The
> > > > > > > > > leader
> > > > > > > > > >> > > could then maintain the size incrementally
> > afterwards. On
> > > > > leader
> > > > > > > > > >> change,
> > > > > > > > > >> > > other replicas can make a
> > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > >> > > topicIdPartition, int leaderEpoch) call to get the
> > size of
> > > > > newly
> > > > > > > > > >> > generated
> > > > > > > > > >> > > segments.
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thanks,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Jun
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > > >> >
> > > > > > > > > >> > > wrote:
> > > > > > > > > >> > >
> > > > > > > > > >> > > > > Is the new method enough for doing size-based
> > > > retention?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > Yes. You are right in assuming that this API only
> > > > > provides the
> > > > > > > > > >> Remote
> > > > > > > > > >> > > > storage size (for current epoch chain). We would
> use
> > > > this
> > > > > API
> > > > > > > > for
> > > > > > > > > >> size
> > > > > > > > > >> > > > based retention along with a value of
> > > > > localOnlyLogSegmentSize
> > > > > > > > > which
> > > > > > > > > >> is
> > > > > > > > > >> > > > computed as
> > > > > Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > > > > > > > >> > > > highestOffsetWithRemoteIndex)). Hence,
> > (total_log_size =
> > > > > > > > > >> > > > remoteLogSizeBytes +
> log.localOnlyLogSegmentSize). I
> > > > have
> > > > > > > > updated
> > > > > > > > > >> the
> > > > > > > > > >> > KIP
> > > > > > > > > >> > > > with this information. You can also check an
> example
> > > > > > > > > implementation
> > > > > > > > > >> at
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Do you imagine all accesses to remote metadata
> > will be
> > > > > > > across
> > > > > > > > > the
> > > > > > > > > >> > > network
> > > > > > > > > >> > > > or will there be some local in-memory cache?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > I would expect a disk-less implementation to
> > maintain a
> > > > > finite
> > > > > > > > > >> > in-memory
> > > > > > > > > >> > > > cache for segment metadata to optimize the number
> of
> > > > > network
> > > > > > > > calls
> > > > > > > > > >> made
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > fetch the data. In future, we can think about
> > bringing
> > > > > this
> > > > > > > > finite
> > > > > > > > > >> size
> > > > > > > > > >> > > > cache into RLM itself but that's probably a
> > conversation
> > > > > for a
> > > > > > > > > >> > different
> > > > > > > > > >> > > > KIP. There are many other things we would like to
> > do to
> > > > > > > optimize
> > > > > > > > > the
> > > > > > > > > >> > > Tiered
> > > > > > > > > >> > > > storage interface such as introducing a circular
> > buffer
> > > > /
> > > > > > > > > streaming
> > > > > > > > > >> > > > interface from RSM (so that we don't have to wait
> to
> > > > > fetch the
> > > > > > > > > >> entire
> > > > > > > > > >> > > > segment before starting to send records to the
> > > > consumer),
> > > > > > > > caching
> > > > > > > > > >> the
> > > > > > > > > >> > > > segments fetched from RSM locally (I would assume
> > all
> > > > RSM
> > > > > > > plugin
> > > > > > > > > >> > > > implementations to do this, might as well add it
> to
> > RLM)
> > > > > etc.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > --
> > > > > > > > > >> > > > Divij Vaidya
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > >
> > > > > > > > > >> > > wrote:
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Hi, Divij,
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > Thanks for the reply.
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > Is the new method enough for doing size-based
> > > > > retention? It
> > > > > > > > > gives
> > > > > > > > > >> the
> > > > > > > > > >> > > > total
> > > > > > > > > >> > > > > size of the remote segments, but it seems that
> we
> > > > still
> > > > > > > don't
> > > > > > > > > know
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > exact total size for a log since there could be
> > > > > overlapping
> > > > > > > > > >> segments
> > > > > > > > > >> > > > > between the remote and the local segments.
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > You mentioned a disk-less implementation. Do you
> > > > > imagine all
> > > > > > > > > >> accesses
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > > remote metadata will be across the network or
> will
> > > > > there be
> > > > > > > > some
> > > > > > > > > >> > local
> > > > > > > > > >> > > > > in-memory cache?
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > Thanks,
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > Jun
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > > > > > > > > >> divijvaidya13@gmail.com
> > > > > > > > > >> > >
> > > > > > > > > >> > > > > wrote:
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > > The method is needed for RLMM implementations
> > which
> > > > > fetch
> > > > > > > > the
> > > > > > > > > >> > > > information
> > > > > > > > > >> > > > > > over the network and not for the disk based
> > > > > > > implementations
> > > > > > > > > >> (such
> > > > > > > > > >> > as
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > default topic based RLMM).
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > I would argue that adding this API makes the
> > > > interface
> > > > > > > more
> > > > > > > > > >> generic
> > > > > > > > > >> > > > than
> > > > > > > > > >> > > > > > what it is today. This is because, with the
> > current
> > > > > APIs
> > > > > > > an
> > > > > > > > > >> > > implementor
> > > > > > > > > >> > > > > is
> > > > > > > > > >> > > > > > restricted to use disk based RLMM solutions
> only
> > > > > (i.e. the
> > > > > > > > > >> default
> > > > > > > > > >> > > > > > solution) whereas if we add this new API, we
> > unblock
> > > > > usage
> > > > > > > > of
> > > > > > > > > >> > network
> > > > > > > > > >> > > > > based
> > > > > > > > > >> > > > > > RLMM implementations such as databases.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > >> >
> > > > > > > > > >> > > > wrote:
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > > Hi, Divij,
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Thanks for the reply.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Point#2. My high level question is that is
> > the new
> > > > > > > method
> > > > > > > > > >> needed
> > > > > > > > > >> > > for
> > > > > > > > > >> > > > > > every
> > > > > > > > > >> > > > > > > implementation of remote storage or just
> for a
> > > > > specific
> > > > > > > > > >> > > > implementation.
> > > > > > > > > >> > > > > > The
> > > > > > > > > >> > > > > > > issues that you pointed out exist for the
> > default
> > > > > > > > > >> implementation
> > > > > > > > > >> > of
> > > > > > > > > >> > > > > RLMM
> > > > > > > > > >> > > > > > as
> > > > > > > > > >> > > > > > > well and so far, the default implementation
> > hasn't
> > > > > > > found a
> > > > > > > > > >> need
> > > > > > > > > >> > > for a
> > > > > > > > > >> > > > > > > similar new method. For public interface,
> > ideally
> > > > we
> > > > > > > want
> > > > > > > > to
> > > > > > > > > >> make
> > > > > > > > > >> > > it
> > > > > > > > > >> > > > > more
> > > > > > > > > >> > > > > > > general.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Thanks,
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Jun
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij
> Vaidya <
> > > > > > > > > >> > > > divijvaidya13@gmail.com>
> > > > > > > > > >> > > > > > > wrote:
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > Thank you Jun and Alex for your comments.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > Point#1: You are right Jun. As Alex
> > mentioned,
> > > > the
> > > > > > > > > "derived
> > > > > > > > > >> > > > metadata"
> > > > > > > > > >> > > > > > can
> > > > > > > > > >> > > > > > > > increase the size of cached metadata by a
> > factor
> > > > > of 10
> > > > > > > > but
> > > > > > > > > >> it
> > > > > > > > > >> > > > should
> > > > > > > > > >> > > > > be
> > > > > > > > > >> > > > > > > ok
> > > > > > > > > >> > > > > > > > to cache just the actual metadata. My
> point
> > > > about
> > > > > size
> > > > > > > > > >> being a
> > > > > > > > > >> > > > > > limitation
> > > > > > > > > >> > > > > > > > for using cache is not valid anymore.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > Point#2: For a new replica, it would still
> > have
> > > > to
> > > > > > > fetch
> > > > > > > > > the
> > > > > > > > > >> > > > metadata
> > > > > > > > > >> > > > > > > over
> > > > > > > > > >> > > > > > > > the network to initiate the warm up of the
> > cache
> > > > > and
> > > > > > > > > hence,
> > > > > > > > > >> > > > increase
> > > > > > > > > >> > > > > > the
> > > > > > > > > >> > > > > > > > start time of the archival process. Please
> > also
> > > > > note
> > > > > > > the
> > > > > > > > > >> > > > > repercussions
> > > > > > > > > >> > > > > > of
> > > > > > > > > >> > > > > > > > the warm up scan that Alex mentioned in
> this
> > > > > thread as
> > > > > > > > > part
> > > > > > > > > >> of
> > > > > > > > > >> > > > > #102.2.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying
> > that.
> > > > My
> > > > > > > point
> > > > > > > > > >> about
> > > > > > > > > >> > > size
> > > > > > > > > >> > > > > > being
> > > > > > > > > >> > > > > > > a
> > > > > > > > > >> > > > > > > > limitation for using cache is not valid
> > anymore.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > 101#: Alex, if I understand correctly, you
> > are
> > > > > > > > suggesting
> > > > > > > > > to
> > > > > > > > > >> > > cache
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > total size at the leader and update it on
> > > > > archival.
> > > > > > > This
> > > > > > > > > >> > wouldn't
> > > > > > > > > >> > > > > work
> > > > > > > > > >> > > > > > > for
> > > > > > > > > >> > > > > > > > cases when the leader restarts where we
> > would
> > > > > have to
> > > > > > > > > make a
> > > > > > > > > >> > full
> > > > > > > > > >> > > > > scan
> > > > > > > > > >> > > > > > > > to update the total size entry on startup.
> > We
> > > > > expect
> > > > > > > > users
> > > > > > > > > >> to
> > > > > > > > > >> > > store
> > > > > > > > > >> > > > > > data
> > > > > > > > > >> > > > > > > > over longer duration in remote storage
> which
> > > > > increases
> > > > > > > > the
> > > > > > > > > >> > > > likelihood
> > > > > > > > > >> > > > > > of
> > > > > > > > > >> > > > > > > > leader restarts / failovers.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > 102#.1: I don't think that the current
> > design
> > > > > > > > accommodates
> > > > > > > > > >> the
> > > > > > > > > >> > > fact
> > > > > > > > > >> > > > > > that
> > > > > > > > > >> > > > > > > > data corruption could happen at the RLMM
> > plugin
> > > > > (we
> > > > > > > > don't
> > > > > > > > > >> have
> > > > > > > > > >> > > > > checksum
> > > > > > > > > >> > > > > > > as
> > > > > > > > > >> > > > > > > > a field in metadata as part of KIP405). If
> > data
> > > > > > > > corruption
> > > > > > > > > >> > > occurs,
> > > > > > > > > >> > > > w/
> > > > > > > > > >> > > > > > or
> > > > > > > > > >> > > > > > > > w/o the cache, it would be a different
> > problem
> > > > to
> > > > > > > > solve. I
> > > > > > > > > >> > would
> > > > > > > > > >> > > > like
> > > > > > > > > >> > > > > > to
> > > > > > > > > >> > > > > > > > keep this outside the scope of this KIP.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > 102#.2: Agree. This remains as the main
> > concern
> > > > > for
> > > > > > > > using
> > > > > > > > > >> the
> > > > > > > > > >> > > cache
> > > > > > > > > >> > > > > to
> > > > > > > > > >> > > > > > > > fetch total size.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > Regards,
> > > > > > > > > >> > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre
> > > > > Dupriez <
> > > > > > > > > >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > > Hi Divij,
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > Thanks for the KIP. Please find some
> > comments
> > > > > based
> > > > > > > on
> > > > > > > > > >> what I
> > > > > > > > > >> > > > read
> > > > > > > > > >> > > > > on
> > > > > > > > > >> > > > > > > > > this thread so far - apologies for the
> > repeats
> > > > > and
> > > > > > > the
> > > > > > > > > >> late
> > > > > > > > > >> > > > reply.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > If I understand correctly, one of the
> main
> > > > > elements
> > > > > > > of
> > > > > > > > > >> > > discussion
> > > > > > > > > >> > > > > is
> > > > > > > > > >> > > > > > > > > about caching in Kafka versus delegation
> > of
> > > > > > > providing
> > > > > > > > > the
> > > > > > > > > >> > > remote
> > > > > > > > > >> > > > > size
> > > > > > > > > >> > > > > > > > > of a topic-partition to the plugin.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > A few comments:
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 100. The size of the “derived metadata”
> > which
> > > > is
> > > > > > > > managed
> > > > > > > > > >> by
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > > plugin
> > > > > > > > > >> > > > > > > > > to represent an rlmMetadata can indeed
> be
> > > > close
> > > > > to 1
> > > > > > > > kB
> > > > > > > > > on
> > > > > > > > > >> > > > average
> > > > > > > > > >> > > > > > > > > depending on its own internal structure,
> > e.g.
> > > > > the
> > > > > > > > > >> redundancy
> > > > > > > > > >> > it
> > > > > > > > > >> > > > > > > > > enforces (unfortunately resulting to
> > > > > duplication),
> > > > > > > > > >> additional
> > > > > > > > > >> > > > > > > > > information such as checksums and
> primary
> > and
> > > > > > > > secondary
> > > > > > > > > >> > > indexable
> > > > > > > > > >> > > > > > > > > keys. But indeed, the rlmMetadata is
> > itself a
> > > > > > > lighter
> > > > > > > > > data
> > > > > > > > > >> > > > > structure
> > > > > > > > > >> > > > > > > > > by a factor of 10. And indeed, instead
> of
> > > > > caching
> > > > > > > the
> > > > > > > > > >> > “derived
> > > > > > > > > >> > > > > > > > > metadata”, only the rlmMetadata could
> be,
> > > > which
> > > > > > > should
> > > > > > > > > >> > address
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > > concern regarding the memory occupancy
> of
> > the
> > > > > cache.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 101. I am not sure I fully understand
> why
> > we
> > > > > would
> > > > > > > > need
> > > > > > > > > to
> > > > > > > > > >> > > cache
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > > list of rlmMetadata to retain the remote
> > size
> > > > > of a
> > > > > > > > > >> > > > topic-partition.
> > > > > > > > > >> > > > > > > > > Since the leader of a topic-partition
> is,
> > in
> > > > > > > > > >> non-degenerated
> > > > > > > > > >> > > > cases,
> > > > > > > > > >> > > > > > > > > the only actor which can mutate the
> remote
> > > > part
> > > > > of
> > > > > > > the
> > > > > > > > > >> > > > > > > > > topic-partition, hence its size, it
> could
> > in
> > > > > theory
> > > > > > > > only
> > > > > > > > > >> > cache
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > > size of the remote log once it has
> > calculated
> > > > > it? In
> > > > > > > > > which
> > > > > > > > > >> > case
> > > > > > > > > >> > > > > there
> > > > > > > > > >> > > > > > > > > would not be any problem regarding the
> > size of
> > > > > the
> > > > > > > > > caching
> > > > > > > > > >> > > > > strategy.
> > > > > > > > > >> > > > > > > > > Did I miss something there?
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 102. There may be a few challenges to
> > consider
> > > > > with
> > > > > > > > > >> caching:
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 102.1) As mentioned above, the caching
> > > > strategy
> > > > > > > > assumes
> > > > > > > > > no
> > > > > > > > > >> > > > mutation
> > > > > > > > > >> > > > > > > > > outside the lifetime of a leader. While
> > this
> > > > is
> > > > > true
> > > > > > > > in
> > > > > > > > > >> the
> > > > > > > > > >> > > > normal
> > > > > > > > > >> > > > > > > > > course of operation, there could be
> > accidental
> > > > > > > > mutation
> > > > > > > > > >> > outside
> > > > > > > > > >> > > > of
> > > > > > > > > >> > > > > > the
> > > > > > > > > >> > > > > > > > > leader and a loss of consistency between
> > the
> > > > > cached
> > > > > > > > > state
> > > > > > > > > >> and
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > > > > > actual remote representation of the log.
> > E.g.
> > > > > > > > > split-brain
> > > > > > > > > >> > > > > scenarios,
> > > > > > > > > >> > > > > > > > > bugs in the plugins, bugs in external
> > systems
> > > > > with
> > > > > > > > > >> mutating
> > > > > > > > > >> > > > access
> > > > > > > > > >> > > > > on
> > > > > > > > > >> > > > > > > > > the derived metadata. In the worst
> case, a
> > > > drift
> > > > > > > > between
> > > > > > > > > >> the
> > > > > > > > > >> > > > cached
> > > > > > > > > >> > > > > > > > > size and the actual size could lead to
> > > > > over-deleting
> > > > > > > > > >> remote
> > > > > > > > > >> > > data
> > > > > > > > > >> > > > > > which
> > > > > > > > > >> > > > > > > > > is a durability risk.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > The alternative you propose, by making
> the
> > > > > plugin
> > > > > > > the
> > > > > > > > > >> source
> > > > > > > > > >> > of
> > > > > > > > > >> > > > > truth
> > > > > > > > > >> > > > > > > > > w.r.t. to the size of the remote log,
> can
> > make
> > > > > it
> > > > > > > > easier
> > > > > > > > > >> to
> > > > > > > > > >> > > avoid
> > > > > > > > > >> > > > > > > > > inconsistencies between plugin-managed
> > > > metadata
> > > > > and
> > > > > > > > the
> > > > > > > > > >> > remote
> > > > > > > > > >> > > > log
> > > > > > > > > >> > > > > > > > > from the perspective of Kafka. On the
> > other
> > > > > hand,
> > > > > > > > plugin
> > > > > > > > > >> > > vendors
> > > > > > > > > >> > > > > > would
> > > > > > > > > >> > > > > > > > > have to implement it with the expected
> > > > > efficiency to
> > > > > > > > > have
> > > > > > > > > >> it
> > > > > > > > > >> > > > yield
> > > > > > > > > >> > > > > > > > > benefits.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 102.2) As you mentioned, the caching
> > strategy
> > > > in
> > > > > > > Kafka
> > > > > > > > > >> would
> > > > > > > > > >> > > > still
> > > > > > > > > >> > > > > > > > > require one iteration over the list of
> > > > > rlmMetadata
> > > > > > > > when
> > > > > > > > > >> the
> > > > > > > > > >> > > > > > leadership
> > > > > > > > > >> > > > > > > > > of a topic-partition is assigned to a
> > broker,
> > > > > while
> > > > > > > > the
> > > > > > > > > >> > plugin
> > > > > > > > > >> > > > can
> > > > > > > > > >> > > > > > > > > offer alternative constant-time
> > approaches.
> > > > This
> > > > > > > > > >> calculation
> > > > > > > > > >> > > > cannot
> > > > > > > > > >> > > > > > be
> > > > > > > > > >> > > > > > > > > put on the LeaderAndIsr path and would
> be
> > > > > performed
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > >> > > > > > background.
> > > > > > > > > >> > > > > > > > > In case of bulk leadership migration,
> > listing
> > > > > the
> > > > > > > > > >> rlmMetadata
> > > > > > > > > >> > > > could
> > > > > > > > > >> > > > > > a)
> > > > > > > > > >> > > > > > > > > result in request bursts to any backend
> > system
> > > > > the
> > > > > > > > > plugin
> > > > > > > > > >> may
> > > > > > > > > >> > > use
> > > > > > > > > >> > > > > > > > > [which shouldn’t be a problem for
> > > > > high-throughput
> > > > > > > data
> > > > > > > > > >> stores
> > > > > > > > > >> > > but
> > > > > > > > > >> > > > > > > > > could have cost implications] b)
> increase
> > > > > > > utilisation
> > > > > > > > > >> > timespan
> > > > > > > > > >> > > of
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > > RLM threads for these calculations
> > potentially
> > > > > > > leading
> > > > > > > > > to
> > > > > > > > > >> > > > transient
> > > > > > > > > >> > > > > > > > > starvation of tasks queued for,
> typically,
> > > > > > > offloading
> > > > > > > > > >> > > operations
> > > > > > > > > >> > > > c)
> > > > > > > > > >> > > > > > > > > could have a non-marginal CPU footprint
> on
> > > > > hardware
> > > > > > > > with
> > > > > > > > > >> > strict
> > > > > > > > > >> > > > > > > > > resource constraints. All these elements
> > could
> > > > > have
> > > > > > > an
> > > > > > > > > >> impact
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > > some
> > > > > > > > > >> > > > > > > > > degree depending on the operational
> > > > environment.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > From a design perspective, one question
> is
> > > > > where we
> > > > > > > > want
> > > > > > > > > >> the
> > > > > > > > > >> > > > source
> > > > > > > > > >> > > > > > of
> > > > > > > > > >> > > > > > > > > truth w.r.t. remote log size to be
> during
> > the
> > > > > > > lifetime
> > > > > > > > > of
> > > > > > > > > >> a
> > > > > > > > > >> > > > leader.
> > > > > > > > > >> > > > > > > > > The responsibility of maintaining a
> > consistent
> > > > > > > > > >> representation
> > > > > > > > > >> > > of
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > > remote log is shared by Kafka and the
> > plugin.
> > > > > Which
> > > > > > > > > >> system is
> > > > > > > > > >> > > > best
> > > > > > > > > >> > > > > > > > > placed to maintain such a state while
> > > > providing
> > > > > the
> > > > > > > > > >> highest
> > > > > > > > > >> > > > > > > > > consistency guarantees is something both
> > Kafka
> > > > > and
> > > > > > > > > plugin
> > > > > > > > > >> > > > designers
> > > > > > > > > >> > > > > > > > > could help understand better.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > Many thanks,
> > > > > > > > > >> > > > > > > > > Alexandre
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > > > > > > > > >> > <jun@confluent.io.invalid
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > a
> > > > > > > > > >> > > > > > > > écrit :
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Hi, Divij,
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Thanks for the reply.
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Point #1. Is the average remote
> segment
> > > > > metadata
> > > > > > > > > really
> > > > > > > > > >> > 1KB?
> > > > > > > > > >> > > > > What's
> > > > > > > > > >> > > > > > > > > listed
> > > > > > > > > >> > > > > > > > > > in the public interface is probably
> well
> > > > > below 100
> > > > > > > > > >> bytes.
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Point #2. I guess you are assuming
> that
> > each
> > > > > > > broker
> > > > > > > > > only
> > > > > > > > > >> > > caches
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > > remote
> > > > > > > > > >> > > > > > > > > > segment metadata in memory. An
> > alternative
> > > > > > > approach
> > > > > > > > is
> > > > > > > > > >> to
> > > > > > > > > >> > > cache
> > > > > > > > > >> > > > > > them
> > > > > > > > > >> > > > > > > in
> > > > > > > > > >> > > > > > > > > > both memory and local disk. That way,
> on
> > > > > broker
> > > > > > > > > restart,
> > > > > > > > > >> > you
> > > > > > > > > >> > > > just
> > > > > > > > > >> > > > > > > need
> > > > > > > > > >> > > > > > > > to
> > > > > > > > > >> > > > > > > > > > fetch the new remote segments'
> metadata
> > > > using
> > > > > the
> > > > > > > > > >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > topicIdPartition,
> > > > > > > > > >> > int
> > > > > > > > > >> > > > > > > > leaderEpoch)
> > > > > > > > > >> > > > > > > > > > api. Will that work?
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Point #3. Thanks for the explanation
> > and it
> > > > > sounds
> > > > > > > > > good.
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Thanks,
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Jun
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij
> > > > Vaidya <
> > > > > > > > > >> > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > >> > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > Hi Jun
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > There are three points that I would
> > like
> > > > to
> > > > > > > > present
> > > > > > > > > >> here:
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > 1. We would require a large cache
> > size to
> > > > > > > > > efficiently
> > > > > > > > > >> > cache
> > > > > > > > > >> > > > all
> > > > > > > > > >> > > > > > > > segment
> > > > > > > > > >> > > > > > > > > > > metadata.
> > > > > > > > > >> > > > > > > > > > > 2. Linear scan of all metadata at
> > broker
> > > > > startup
> > > > > > > > to
> > > > > > > > > >> > > populate
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > cache
> > > > > > > > > >> > > > > > > > > will
> > > > > > > > > >> > > > > > > > > > > be slow and will impact the archival
> > > > > process.
> > > > > > > > > >> > > > > > > > > > > 3. There is no other use case where
> a
> > full
> > > > > scan
> > > > > > > of
> > > > > > > > > >> > segment
> > > > > > > > > >> > > > > > metadata
> > > > > > > > > >> > > > > > > > is
> > > > > > > > > >> > > > > > > > > > > required.
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > Let's start by quantifying 1. Here's
> > my
> > > > > estimate
> > > > > > > > for
> > > > > > > > > >> the
> > > > > > > > > >> > > size
> > > > > > > > > >> > > > > of
> > > > > > > > > >> > > > > > > the
> > > > > > > > > >> > > > > > > > > cache.
> > > > > > > > > >> > > > > > > > > > > Average size of segment metadata =
> > 1KB.
> > > > This
> > > > > > > could
> > > > > > > > > be
> > > > > > > > > >> > more
> > > > > > > > > >> > > if
> > > > > > > > > >> > > > > we
> > > > > > > > > >> > > > > > > have
> > > > > > > > > >> > > > > > > > > > > frequent leader failover with a
> large
> > > > > number of
> > > > > > > > > leader
> > > > > > > > > >> > > epochs
> > > > > > > > > >> > > > > > being
> > > > > > > > > >> > > > > > > > > stored
> > > > > > > > > >> > > > > > > > > > > per segment.
> > > > > > > > > >> > > > > > > > > > > Segment size = 100MB. Users will
> > prefer to
> > > > > > > reduce
> > > > > > > > > the
> > > > > > > > > >> > > segment
> > > > > > > > > >> > > > > > size
> > > > > > > > > >> > > > > > > > > from the
> > > > > > > > > >> > > > > > > > > > > default value of 1GB to ensure
> timely
> > > > > archival
> > > > > > > of
> > > > > > > > > data
> > > > > > > > > >> > > since
> > > > > > > > > >> > > > > data
> > > > > > > > > >> > > > > > > > from
> > > > > > > > > >> > > > > > > > > > > active segment is not archived.
> > > > > > > > > >> > > > > > > > > > > Cache size = num segments * avg.
> > segment
> > > > > > > metadata
> > > > > > > > > >> size =
> > > > > > > > > >> > > > > > > > > (100TB/100MB)*1KB
> > > > > > > > > >> > > > > > > > > > > = 1GB.
> > > > > > > > > >> > > > > > > > > > > While 1GB for cache may not sound
> > like a
> > > > > large
> > > > > > > > > number
> > > > > > > > > >> for
> > > > > > > > > >> > > > > larger
> > > > > > > > > >> > > > > > > > > machines,
> > > > > > > > > >> > > > > > > > > > > it does eat into the memory as an
> > > > additional
> > > > > > > cache
> > > > > > > > > and
> > > > > > > > > >> > > makes
> > > > > > > > > >> > > > > use
> > > > > > > > > >> > > > > > > > cases
> > > > > > > > > >> > > > > > > > > with
> > > > > > > > > >> > > > > > > > > > > large data retention with low
> > throughout
> > > > > > > expensive
> > > > > > > > > >> (where
> > > > > > > > > >> > > > such
> > > > > > > > > >> > > > > > use
> > > > > > > > > >> > > > > > > > case
> > > > > > > > > >> > > > > > > > > > > would could use smaller machines).
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > About point#2:
> > > > > > > > > >> > > > > > > > > > > Even if we say that all segment
> > metadata
> > > > > can fit
> > > > > > > > > into
> > > > > > > > > >> the
> > > > > > > > > >> > > > > cache,
> > > > > > > > > >> > > > > > we
> > > > > > > > > >> > > > > > > > > will
> > > > > > > > > >> > > > > > > > > > > need to populate the cache on broker
> > > > > startup. It
> > > > > > > > > would
> > > > > > > > > >> > not
> > > > > > > > > >> > > be
> > > > > > > > > >> > > > > in
> > > > > > > > > >> > > > > > > the
> > > > > > > > > >> > > > > > > > > > > critical patch of broker startup and
> > hence
> > > > > won't
> > > > > > > > > >> impact
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > > startup
> > > > > > > > > >> > > > > > > > > time.
> > > > > > > > > >> > > > > > > > > > > But it will impact the time when we
> > could
> > > > > start
> > > > > > > > the
> > > > > > > > > >> > > archival
> > > > > > > > > >> > > > > > > process
> > > > > > > > > >> > > > > > > > > since
> > > > > > > > > >> > > > > > > > > > > the RLM thread pool will be blocked
> > on the
> > > > > first
> > > > > > > > > call
> > > > > > > > > >> to
> > > > > > > > > >> > > > > > > > > > > listRemoteLogSegments(). To scan
> > metadata
> > > > > for
> > > > > > > 1MM
> > > > > > > > > >> > segments
> > > > > > > > > >> > > > > > > (computed
> > > > > > > > > >> > > > > > > > > above)
> > > > > > > > > >> > > > > > > > > > > and transfer 1GB data over the
> network
> > > > from
> > > > > a
> > > > > > > RLMM
> > > > > > > > > >> such
> > > > > > > > > >> > as
> > > > > > > > > >> > > a
> > > > > > > > > >> > > > > > remote
> > > > > > > > > >> > > > > > > > > > > database would be in the order of
> > minutes
> > > > > > > > (depending
> > > > > > > > > >> on
> > > > > > > > > >> > how
> > > > > > > > > >> > > > > > > efficient
> > > > > > > > > >> > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > scan is with the RLMM
> implementation).
> > > > > > > Although, I
> > > > > > > > > >> would
> > > > > > > > > >> > > > > concede
> > > > > > > > > >> > > > > > > that
> > > > > > > > > >> > > > > > > > > > > having RLM threads blocked for a few
> > > > > minutes is
> > > > > > > > > >> perhaps
> > > > > > > > > >> > OK
> > > > > > > > > >> > > > but
> > > > > > > > > >> > > > > if
> > > > > > > > > >> > > > > > > we
> > > > > > > > > >> > > > > > > > > > > introduce the new API proposed in
> the
> > KIP,
> > > > > we
> > > > > > > > would
> > > > > > > > > >> have
> > > > > > > > > >> > a
> > > > > > > > > >> > > > > > > > > > > deterministic startup time for RLM.
> > Adding
> > > > > the
> > > > > > > API
> > > > > > > > > >> comes
> > > > > > > > > >> > > at a
> > > > > > > > > >> > > > > low
> > > > > > > > > >> > > > > > > > cost
> > > > > > > > > >> > > > > > > > > and
> > > > > > > > > >> > > > > > > > > > > I believe the trade off is worth it.
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > About point#3:
> > > > > > > > > >> > > > > > > > > > > We can use
> > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > >> > > > > > topicIdPartition,
> > > > > > > > > >> > > > > > > > int
> > > > > > > > > >> > > > > > > > > > > leaderEpoch) to calculate the
> segments
> > > > > eligible
> > > > > > > > for
> > > > > > > > > >> > > deletion
> > > > > > > > > >> > > > > > (based
> > > > > > > > > >> > > > > > > > on
> > > > > > > > > >> > > > > > > > > size
> > > > > > > > > >> > > > > > > > > > > retention) where leader epoch(s)
> > belong to
> > > > > the
> > > > > > > > > current
> > > > > > > > > >> > > leader
> > > > > > > > > >> > > > > > epoch
> > > > > > > > > >> > > > > > > > > chain.
> > > > > > > > > >> > > > > > > > > > > I understand that it may lead to
> > segments
> > > > > > > > belonging
> > > > > > > > > to
> > > > > > > > > >> > > other
> > > > > > > > > >> > > > > > epoch
> > > > > > > > > >> > > > > > > > > lineage
> > > > > > > > > >> > > > > > > > > > > not getting deleted and would
> require
> > a
> > > > > separate
> > > > > > > > > >> > mechanism
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > > > > delete
> > > > > > > > > >> > > > > > > > > them.
> > > > > > > > > >> > > > > > > > > > > The separate mechanism would anyways
> > be
> > > > > required
> > > > > > > > to
> > > > > > > > > >> > delete
> > > > > > > > > >> > > > > these
> > > > > > > > > >> > > > > > > > > "leaked"
> > > > > > > > > >> > > > > > > > > > > segments as there are other cases
> > which
> > > > > could
> > > > > > > lead
> > > > > > > > > to
> > > > > > > > > >> > leaks
> > > > > > > > > >> > > > > such
> > > > > > > > > >> > > > > > as
> > > > > > > > > >> > > > > > > > > network
> > > > > > > > > >> > > > > > > > > > > problems with RSM mid way writing
> > through.
> > > > > > > segment
> > > > > > > > > >> etc.
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > Thank you for the replies so far.
> They
> > > > have
> > > > > made
> > > > > > > > me
> > > > > > > > > >> > > re-think
> > > > > > > > > >> > > > my
> > > > > > > > > >> > > > > > > > > assumptions
> > > > > > > > > >> > > > > > > > > > > and this dialogue has been very
> > > > > constructive for
> > > > > > > > me.
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > Regards,
> > > > > > > > > >> > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun
> > Rao
> > > > > > > > > >> > > > > > <jun@confluent.io.invalid
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > Hi, Divij,
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > Thanks for the reply.
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > It's true that the data in Kafka
> > could
> > > > be
> > > > > kept
> > > > > > > > > >> longer
> > > > > > > > > >> > > with
> > > > > > > > > >> > > > > > > KIP-405.
> > > > > > > > > >> > > > > > > > > How
> > > > > > > > > >> > > > > > > > > > > > much data do you envision to have
> > per
> > > > > broker?
> > > > > > > > For
> > > > > > > > > >> 100TB
> > > > > > > > > >> > > > data
> > > > > > > > > >> > > > > > per
> > > > > > > > > >> > > > > > > > > broker,
> > > > > > > > > >> > > > > > > > > > > > with 1GB segment and segment
> > metadata of
> > > > > 100
> > > > > > > > > bytes,
> > > > > > > > > >> it
> > > > > > > > > >> > > > > requires
> > > > > > > > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should
> > fit
> > > > in
> > > > > > > > memory.
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > > > > > > > > >> > listRemoteLogSegments()
> > > > > > > > > >> > > > > > methods.
> > > > > > > > > >> > > > > > > > > The one
> > > > > > > > > >> > > > > > > > > > > > you listed
> > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > >> > > > > > > topicIdPartition,
> > > > > > > > > >> > > > > > > > > int
> > > > > > > > > >> > > > > > > > > > > > leaderEpoch) does return data in
> > offset
> > > > > order.
> > > > > > > > > >> However,
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > > other
> > > > > > > > > >> > > > > > > > > > > > one
> > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > >> > > > topicIdPartition)
> > > > > > > > > >> > > > > > > > doesn't
> > > > > > > > > >> > > > > > > > > > > > specify the return order. I assume
> > that
> > > > > you
> > > > > > > need
> > > > > > > > > the
> > > > > > > > > >> > > latter
> > > > > > > > > >> > > > > to
> > > > > > > > > >> > > > > > > > > calculate
> > > > > > > > > >> > > > > > > > > > > > the segment size?
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > Thanks,
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > Jun
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM
> > Divij
> > > > > Vaidya
> > > > > > > <
> > > > > > > > > >> > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > >> > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > *Jun,*
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > *"the default implementation of
> > RLMM
> > > > > does
> > > > > > > > local
> > > > > > > > > >> > > caching,
> > > > > > > > > >> > > > > > > right?"*
> > > > > > > > > >> > > > > > > > > > > > > Yes, Jun. The default
> > implementation
> > > > of
> > > > > RLMM
> > > > > > > > > does
> > > > > > > > > >> > > indeed
> > > > > > > > > >> > > > > > cache
> > > > > > > > > >> > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > segment
> > > > > > > > > >> > > > > > > > > > > > > metadata today, hence, it won't
> > work
> > > > > for use
> > > > > > > > > cases
> > > > > > > > > >> > when
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > number
> > > > > > > > > >> > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > segments in remote storage is
> > large
> > > > > enough
> > > > > > > to
> > > > > > > > > >> exceed
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > size
> > > > > > > > > >> > > > > > > of
> > > > > > > > > >> > > > > > > > > cache.
> > > > > > > > > >> > > > > > > > > > > > As
> > > > > > > > > >> > > > > > > > > > > > > part of this KIP, I will
> > implement the
> > > > > new
> > > > > > > > > >> proposed
> > > > > > > > > >> > API
> > > > > > > > > >> > > > in
> > > > > > > > > >> > > > > > the
> > > > > > > > > >> > > > > > > > > default
> > > > > > > > > >> > > > > > > > > > > > > implementation of RLMM but the
> > > > > underlying
> > > > > > > > > >> > > implementation
> > > > > > > > > >> > > > > will
> > > > > > > > > >> > > > > > > > > still be
> > > > > > > > > >> > > > > > > > > > > a
> > > > > > > > > >> > > > > > > > > > > > > scan. I will pick up optimizing
> > that
> > > > in
> > > > > a
> > > > > > > > > separate
> > > > > > > > > >> > PR.
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > *"we also cache all segment
> > metadata
> > > > in
> > > > > the
> > > > > > > > > >> brokers
> > > > > > > > > >> > > > without
> > > > > > > > > >> > > > > > > > > KIP-405. Do
> > > > > > > > > >> > > > > > > > > > > > you
> > > > > > > > > >> > > > > > > > > > > > > see a need to change that?"*
> > > > > > > > > >> > > > > > > > > > > > > Please correct me if I am wrong
> > here
> > > > > but we
> > > > > > > > > cache
> > > > > > > > > >> > > > metadata
> > > > > > > > > >> > > > > > for
> > > > > > > > > >> > > > > > > > > segments
> > > > > > > > > >> > > > > > > > > > > > > "residing in local storage". The
> > size
> > > > > of the
> > > > > > > > > >> current
> > > > > > > > > >> > > > cache
> > > > > > > > > >> > > > > > > works
> > > > > > > > > >> > > > > > > > > fine
> > > > > > > > > >> > > > > > > > > > > for
> > > > > > > > > >> > > > > > > > > > > > > the scale of the number of
> > segments
> > > > > that we
> > > > > > > > > >> expect to
> > > > > > > > > >> > > > store
> > > > > > > > > >> > > > > > in
> > > > > > > > > >> > > > > > > > > local
> > > > > > > > > >> > > > > > > > > > > > > storage. After KIP-405, that
> cache
> > > > will
> > > > > > > > continue
> > > > > > > > > >> to
> > > > > > > > > >> > > store
> > > > > > > > > >> > > > > > > > metadata
> > > > > > > > > >> > > > > > > > > for
> > > > > > > > > >> > > > > > > > > > > > > segments which are residing in
> > local
> > > > > storage
> > > > > > > > and
> > > > > > > > > >> > hence,
> > > > > > > > > >> > > > we
> > > > > > > > > >> > > > > > > don't
> > > > > > > > > >> > > > > > > > > need
> > > > > > > > > >> > > > > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > change that. For segments which
> > have
> > > > > been
> > > > > > > > > >> offloaded
> > > > > > > > > >> > to
> > > > > > > > > >> > > > > remote
> > > > > > > > > >> > > > > > > > > storage,
> > > > > > > > > >> > > > > > > > > > > it
> > > > > > > > > >> > > > > > > > > > > > > would rely on RLMM. Note that
> the
> > > > scale
> > > > > of
> > > > > > > > data
> > > > > > > > > >> > stored
> > > > > > > > > >> > > in
> > > > > > > > > >> > > > > > RLMM
> > > > > > > > > >> > > > > > > is
> > > > > > > > > >> > > > > > > > > > > > different
> > > > > > > > > >> > > > > > > > > > > > > from local cache because the
> > number of
> > > > > > > > segments
> > > > > > > > > is
> > > > > > > > > >> > > > expected
> > > > > > > > > >> > > > > > to
> > > > > > > > > >> > > > > > > be
> > > > > > > > > >> > > > > > > > > much
> > > > > > > > > >> > > > > > > > > > > > > larger than what current
> > > > implementation
> > > > > > > stores
> > > > > > > > > in
> > > > > > > > > >> > local
> > > > > > > > > >> > > > > > > storage.
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > 2,3,4:
> > > > > > > > > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > does
> > > > > > > > > >> > > > > > > > > specify
> > > > > > > > > >> > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > order i.e. it returns the
> segments
> > > > > sorted by
> > > > > > > > > first
> > > > > > > > > >> > > offset
> > > > > > > > > >> > > > > in
> > > > > > > > > >> > > > > > > > > ascending
> > > > > > > > > >> > > > > > > > > > > > > order. I am copying the API docs
> > for
> > > > > KIP-405
> > > > > > > > > here
> > > > > > > > > >> for
> > > > > > > > > >> > > > your
> > > > > > > > > >> > > > > > > > > reference
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > *Returns iterator of remote log
> > > > segment
> > > > > > > > > metadata,
> > > > > > > > > >> > > sorted
> > > > > > > > > >> > > > by
> > > > > > > > > >> > > > > > > > {@link
> > > > > > > > > >> > > > > > > > > > > > >
> > > > RemoteLogSegmentMetadata#startOffset()}
> > > > > > > > > >> inascending
> > > > > > > > > >> > > order
> > > > > > > > > >> > > > > > which
> > > > > > > > > >> > > > > > > > > > > contains
> > > > > > > > > >> > > > > > > > > > > > > the given leader epoch. This is
> > used
> > > > by
> > > > > > > remote
> > > > > > > > > log
> > > > > > > > > >> > > > > retention
> > > > > > > > > >> > > > > > > > > management
> > > > > > > > > >> > > > > > > > > > > > > subsystemto fetch the segment
> > metadata
> > > > > for a
> > > > > > > > > given
> > > > > > > > > >> > > leader
> > > > > > > > > >> > > > > > > > > epoch.@param
> > > > > > > > > >> > > > > > > > > > > > > topicIdPartition topic
> > partition@param
> > > > > > > > > >> leaderEpoch
> > > > > > > > > >> > > > > > leader
> > > > > > > > > >> > > > > > > > > > > > > epoch@return
> > > > > > > > > >> > > > > > > > > > > > > Iterator of remote segments,
> > sorted by
> > > > > start
> > > > > > > > > >> offset
> > > > > > > > > >> > in
> > > > > > > > > >> > > > > > > ascending
> > > > > > > > > >> > > > > > > > > > > order. *
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > *Luke,*
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > 5. Note that we are trying to
> > optimize
> > > > > the
> > > > > > > > > >> efficiency
> > > > > > > > > >> > > of
> > > > > > > > > >> > > > > size
> > > > > > > > > >> > > > > > > > based
> > > > > > > > > >> > > > > > > > > > > > > retention for remote storage.
> > KIP-405
> > > > > does
> > > > > > > not
> > > > > > > > > >> > > introduce
> > > > > > > > > >> > > > a
> > > > > > > > > >> > > > > > new
> > > > > > > > > >> > > > > > > > > config
> > > > > > > > > >> > > > > > > > > > > for
> > > > > > > > > >> > > > > > > > > > > > > periodically checking remote
> > similar
> > > > to
> > > > > > > > > >> > > > > > > > > > > log.retention.check.interval.ms
> > > > > > > > > >> > > > > > > > > > > > > which is applicable for remote
> > > > storage.
> > > > > > > Hence,
> > > > > > > > > the
> > > > > > > > > >> > > metric
> > > > > > > > > >> > > > > > will
> > > > > > > > > >> > > > > > > be
> > > > > > > > > >> > > > > > > > > > > updated
> > > > > > > > > >> > > > > > > > > > > > > at the time of invoking log
> > retention
> > > > > check
> > > > > > > > for
> > > > > > > > > >> > remote
> > > > > > > > > >> > > > tier
> > > > > > > > > >> > > > > > > which
> > > > > > > > > >> > > > > > > > > is
> > > > > > > > > >> > > > > > > > > > > > > pending implementation today. We
> > can
> > > > > perhaps
> > > > > > > > > come
> > > > > > > > > >> > back
> > > > > > > > > >> > > > and
> > > > > > > > > >> > > > > > > update
> > > > > > > > > >> > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > metric description after the
> > > > > implementation
> > > > > > > of
> > > > > > > > > log
> > > > > > > > > >> > > > > retention
> > > > > > > > > >> > > > > > > > check
> > > > > > > > > >> > > > > > > > > in
> > > > > > > > > >> > > > > > > > > > > > > RemoteLogManager.
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > --
> > > > > > > > > >> > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM
> > Luke
> > > > > Chen <
> > > > > > > > > >> > > > > showuon@gmail.com
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > Hi Divij,
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > One more question about the
> > metric:
> > > > > > > > > >> > > > > > > > > > > > > > I think the metric will be
> > updated
> > > > > when
> > > > > > > > > >> > > > > > > > > > > > > > (1) each time we run the log
> > > > retention
> > > > > > > check
> > > > > > > > > >> (that
> > > > > > > > > >> > > is,
> > > > > > > > > >> > > > > > > > > > > > > >
> log.retention.check.interval.ms
> > )
> > > > > > > > > >> > > > > > > > > > > > > > (2) When user explicitly call
> > > > > > > > getRemoteLogSize
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > Is that correct?
> > > > > > > > > >> > > > > > > > > > > > > > Maybe we should add a note in
> > metric
> > > > > > > > > >> description,
> > > > > > > > > >> > > > > > otherwise,
> > > > > > > > > >> > > > > > > > when
> > > > > > > > > >> > > > > > > > > > > user
> > > > > > > > > >> > > > > > > > > > > > > got,
> > > > > > > > > >> > > > > > > > > > > > > > let's say 0 of
> > RemoteLogSizeBytes,
> > > > > will be
> > > > > > > > > >> > surprised.
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > Otherwise, LGTM
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > Thank you for the KIP
> > > > > > > > > >> > > > > > > > > > > > > > Luke
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55
> AM
> > Jun
> > > > > Rao
> > > > > > > > > >> > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > Thanks for the explanation.
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > 1. Hmm, the default
> > implementation
> > > > > of
> > > > > > > RLMM
> > > > > > > > > >> does
> > > > > > > > > >> > > local
> > > > > > > > > >> > > > > > > > caching,
> > > > > > > > > >> > > > > > > > > > > right?
> > > > > > > > > >> > > > > > > > > > > > > > > Currently, we also cache all
> > > > segment
> > > > > > > > > metadata
> > > > > > > > > >> in
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > > > brokers
> > > > > > > > > >> > > > > > > > > > > without
> > > > > > > > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need
> to
> > > > change
> > > > > > > that?
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation
> > makes
> > > > > > > sense.
> > > > > > > > > >> > However,
> > > > > > > > > >> > > > > > > > > > > > > > > currently,
> > > > > > > > > >> > > > > >
> RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > > > > > doesn't
> > > > > > > > > >> > > > > > > > > > > > > > specify
> > > > > > > > > >> > > > > > > > > > > > > > > a particular order of the
> > > > iterator.
> > > > > Do
> > > > > > > you
> > > > > > > > > >> intend
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > > > change
> > > > > > > > > >> > > > > > > > > that?
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > Jun
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31
> AM
> > > > Divij
> > > > > > > > Vaidya
> > > > > > > > > <
> > > > > > > > > >> > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > Hey Jun
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > Thank you for your
> comments.
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor
> could
> > > > ensure
> > > > > > > that
> > > > > > > > > >> > > > > > > > > listRemoteLogSegments()
> > > > > > > > > >> > > > > > > > > > > is
> > > > > > > > > >> > > > > > > > > > > > > > fast"*
> > > > > > > > > >> > > > > > > > > > > > > > > > This would be ideal but
> > > > > pragmatically,
> > > > > > > > it
> > > > > > > > > is
> > > > > > > > > >> > > > > difficult
> > > > > > > > > >> > > > > > to
> > > > > > > > > >> > > > > > > > > ensure
> > > > > > > > > >> > > > > > > > > > > > that
> > > > > > > > > >> > > > > > > > > > > > > > > > listRemoteLogSegments() is
> > fast.
> > > > > This
> > > > > > > is
> > > > > > > > > >> > because
> > > > > > > > > >> > > of
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > > > > possibility
> > > > > > > > > >> > > > > > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > > a
> > > > > > > > > >> > > > > > > > > > > > > > > > large number of segments
> > (much
> > > > > larger
> > > > > > > > than
> > > > > > > > > >> what
> > > > > > > > > >> > > > Kafka
> > > > > > > > > >> > > > > > > > > currently
> > > > > > > > > >> > > > > > > > > > > > > handles
> > > > > > > > > >> > > > > > > > > > > > > > > > with local storage today)
> > would
> > > > > make
> > > > > > > it
> > > > > > > > > >> > > infeasible
> > > > > > > > > >> > > > to
> > > > > > > > > >> > > > > > > adopt
> > > > > > > > > >> > > > > > > > > > > > > strategies
> > > > > > > > > >> > > > > > > > > > > > > > > such
> > > > > > > > > >> > > > > > > > > > > > > > > > as local caching to
> improve
> > the
> > > > > > > > > performance
> > > > > > > > > >> of
> > > > > > > > > >> > > > > > > > > > > > listRemoteLogSegments.
> > > > > > > > > >> > > > > > > > > > > > > > > Apart
> > > > > > > > > >> > > > > > > > > > > > > > > > from caching (which won't
> > work
> > > > > due to
> > > > > > > > size
> > > > > > > > > >> > > > > > limitations) I
> > > > > > > > > >> > > > > > > > > can't
> > > > > > > > > >> > > > > > > > > > > > think
> > > > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > > > > other strategies which may
> > > > > eliminate
> > > > > > > the
> > > > > > > > > >> need
> > > > > > > > > >> > for
> > > > > > > > > >> > > > IO
> > > > > > > > > >> > > > > > > > > > > > > > > > operations proportional to
> > the
> > > > > number
> > > > > > > of
> > > > > > > > > >> total
> > > > > > > > > >> > > > > > segments.
> > > > > > > > > >> > > > > > > > > Please
> > > > > > > > > >> > > > > > > > > > > > > advise
> > > > > > > > > >> > > > > > > > > > > > > > if
> > > > > > > > > >> > > > > > > > > > > > > > > > you have something in
> mind.
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds
> > the
> > > > > > > retention
> > > > > > > > > >> size,
> > > > > > > > > >> > we
> > > > > > > > > >> > > > need
> > > > > > > > > >> > > > > > to
> > > > > > > > > >> > > > > > > > > > > determine
> > > > > > > > > >> > > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > > subset of segments to
> > delete to
> > > > > bring
> > > > > > > > the
> > > > > > > > > >> size
> > > > > > > > > >> > > > within
> > > > > > > > > >> > > > > > the
> > > > > > > > > >> > > > > > > > > > > retention
> > > > > > > > > >> > > > > > > > > > > > > > > limit.
> > > > > > > > > >> > > > > > > > > > > > > > > > Do we need to call
> > > > > > > > > >> > > > > > > > > > >
> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > > > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > > > > determine that?"*
> > > > > > > > > >> > > > > > > > > > > > > > > > Yes, we need to call
> > > > > > > > > >> listRemoteLogSegments() to
> > > > > > > > > >> > > > > > determine
> > > > > > > > > >> > > > > > > > > which
> > > > > > > > > >> > > > > > > > > > > > > > segments
> > > > > > > > > >> > > > > > > > > > > > > > > > should be deleted. But
> > there is
> > > > a
> > > > > > > > > difference
> > > > > > > > > >> > with
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > use
> > > > > > > > > >> > > > > > > > > case we
> > > > > > > > > >> > > > > > > > > > > > are
> > > > > > > > > >> > > > > > > > > > > > > > > > trying to optimize with
> this
> > > > KIP.
> > > > > To
> > > > > > > > > >> determine
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > > subset
> > > > > > > > > >> > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > segments
> > > > > > > > > >> > > > > > > > > > > > > > > which
> > > > > > > > > >> > > > > > > > > > > > > > > > would be deleted, we only
> > read
> > > > > > > metadata
> > > > > > > > > for
> > > > > > > > > >> > > > segments
> > > > > > > > > >> > > > > > > which
> > > > > > > > > >> > > > > > > > > would
> > > > > > > > > >> > > > > > > > > > > be
> > > > > > > > > >> > > > > > > > > > > > > > > deleted
> > > > > > > > > >> > > > > > > > > > > > > > > > via the
> > listRemoteLogSegments().
> > > > > But
> > > > > > > to
> > > > > > > > > >> > determine
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > > > > totalLogSize,
> > > > > > > > > >> > > > > > > > > > > > > > which
> > > > > > > > > >> > > > > > > > > > > > > > > > is required every time
> > retention
> > > > > logic
> > > > > > > > > >> based on
> > > > > > > > > >> > > > size
> > > > > > > > > >> > > > > > > > > executes, we
> > > > > > > > > >> > > > > > > > > > > > > read
> > > > > > > > > >> > > > > > > > > > > > > > > > metadata of *all* the
> > segments
> > > > in
> > > > > > > remote
> > > > > > > > > >> > storage.
> > > > > > > > > >> > > > > > Hence,
> > > > > > > > > >> > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > number
> > > > > > > > > >> > > > > > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > > > > results returned by
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > > > > > > > > > > *is
> > > > > > > > > >> > > > > > > > > > > > > > > > different when we are
> > > > calculating
> > > > > > > > > >> totalLogSize
> > > > > > > > > >> > > vs.
> > > > > > > > > >> > > > > when
> > > > > > > > > >> > > > > > > we
> > > > > > > > > >> > > > > > > > > are
> > > > > > > > > >> > > > > > > > > > > > > > > determining
> > > > > > > > > >> > > > > > > > > > > > > > > > the subset of segments to
> > > > delete.
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > 3.
> > > > > > > > > >> > > > > > > > > > > > > > > > *"Also, what about
> > time-based
> > > > > > > retention?
> > > > > > > > > To
> > > > > > > > > >> > make
> > > > > > > > > >> > > > that
> > > > > > > > > >> > > > > > > > > efficient,
> > > > > > > > > >> > > > > > > > > > > do
> > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > > > >> > > > > > > > > > > > > > > > to make some additional
> > > > interface
> > > > > > > > > >> changes?"*No.
> > > > > > > > > >> > > > Note
> > > > > > > > > >> > > > > > that
> > > > > > > > > >> > > > > > > > > time
> > > > > > > > > >> > > > > > > > > > > > > > complexity
> > > > > > > > > >> > > > > > > > > > > > > > > > to determine the segments
> > for
> > > > > > > retention
> > > > > > > > is
> > > > > > > > > >> > > > different
> > > > > > > > > >> > > > > > for
> > > > > > > > > >> > > > > > > > time
> > > > > > > > > >> > > > > > > > > > > based
> > > > > > > > > >> > > > > > > > > > > > > vs.
> > > > > > > > > >> > > > > > > > > > > > > > > > size based. For time
> based,
> > the
> > > > > time
> > > > > > > > > >> complexity
> > > > > > > > > >> > > is
> > > > > > > > > >> > > > a
> > > > > > > > > >> > > > > > > > > function of
> > > > > > > > > >> > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > number
> > > > > > > > > >> > > > > > > > > > > > > > > > of segments which are
> > "eligible
> > > > > for
> > > > > > > > > >> deletion"
> > > > > > > > > >> > > > (since
> > > > > > > > > >> > > > > we
> > > > > > > > > >> > > > > > > > only
> > > > > > > > > >> > > > > > > > > read
> > > > > > > > > >> > > > > > > > > > > > > > > metadata
> > > > > > > > > >> > > > > > > > > > > > > > > > for segments which would
> be
> > > > > deleted)
> > > > > > > > > >> whereas in
> > > > > > > > > >> > > > size
> > > > > > > > > >> > > > > > > based
> > > > > > > > > >> > > > > > > > > > > > retention,
> > > > > > > > > >> > > > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > > time complexity is a
> > function of
> > > > > "all
> > > > > > > > > >> segments"
> > > > > > > > > >> > > > > > available
> > > > > > > > > >> > > > > > > > in
> > > > > > > > > >> > > > > > > > > > > remote
> > > > > > > > > >> > > > > > > > > > > > > > > storage
> > > > > > > > > >> > > > > > > > > > > > > > > > (metadata of all segments
> > needs
> > > > > to be
> > > > > > > > read
> > > > > > > > > >> to
> > > > > > > > > >> > > > > calculate
> > > > > > > > > >> > > > > > > the
> > > > > > > > > >> > > > > > > > > total
> > > > > > > > > >> > > > > > > > > > > > > > size).
> > > > > > > > > >> > > > > > > > > > > > > > > As
> > > > > > > > > >> > > > > > > > > > > > > > > > you may observe, this KIP
> > will
> > > > > bring
> > > > > > > the
> > > > > > > > > >> time
> > > > > > > > > >> > > > > > complexity
> > > > > > > > > >> > > > > > > > for
> > > > > > > > > >> > > > > > > > > both
> > > > > > > > > >> > > > > > > > > > > > > time
> > > > > > > > > >> > > > > > > > > > > > > > > > based retention & size
> based
> > > > > retention
> > > > > > > > to
> > > > > > > > > >> the
> > > > > > > > > >> > > same
> > > > > > > > > >> > > > > > > > function.
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > 4. Also, please note that
> > this
> > > > > new API
> > > > > > > > > >> > introduced
> > > > > > > > > >> > > > in
> > > > > > > > > >> > > > > > this
> > > > > > > > > >> > > > > > > > KIP
> > > > > > > > > >> > > > > > > > > > > also
> > > > > > > > > >> > > > > > > > > > > > > > > enables
> > > > > > > > > >> > > > > > > > > > > > > > > > us to provide a metric for
> > total
> > > > > size
> > > > > > > of
> > > > > > > > > >> data
> > > > > > > > > >> > > > stored
> > > > > > > > > >> > > > > in
> > > > > > > > > >> > > > > > > > > remote
> > > > > > > > > >> > > > > > > > > > > > > storage.
> > > > > > > > > >> > > > > > > > > > > > > > > > Without the API,
> > calculation of
> > > > > this
> > > > > > > > > metric
> > > > > > > > > >> > will
> > > > > > > > > >> > > > > become
> > > > > > > > > >> > > > > > > > very
> > > > > > > > > >> > > > > > > > > > > > > expensive
> > > > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > > > >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> > > > > > > > > >> > > > > > > > > > > > > > > > I understand that your
> > > > motivation
> > > > > here
> > > > > > > > is
> > > > > > > > > to
> > > > > > > > > >> > > avoid
> > > > > > > > > >> > > > > > > > polluting
> > > > > > > > > >> > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > interface
> > > > > > > > > >> > > > > > > > > > > > > > > > with optimization specific
> > APIs
> > > > > and I
> > > > > > > > will
> > > > > > > > > >> > agree
> > > > > > > > > >> > > > with
> > > > > > > > > >> > > > > > > that
> > > > > > > > > >> > > > > > > > > goal.
> > > > > > > > > >> > > > > > > > > > > > But
> > > > > > > > > >> > > > > > > > > > > > > I
> > > > > > > > > >> > > > > > > > > > > > > > > > believe that this new API
> > > > > proposed in
> > > > > > > > the
> > > > > > > > > >> KIP
> > > > > > > > > >> > > > brings
> > > > > > > > > >> > > > > in
> > > > > > > > > >> > > > > > > > > > > significant
> > > > > > > > > >> > > > > > > > > > > > > > > > improvement and there is
> no
> > > > other
> > > > > work
> > > > > > > > > >> around
> > > > > > > > > >> > > > > available
> > > > > > > > > >> > > > > > > to
> > > > > > > > > >> > > > > > > > > > > achieve
> > > > > > > > > >> > > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > same
> > > > > > > > > >> > > > > > > > > > > > > > > > performance.
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > Regards,
> > > > > > > > > >> > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at
> > 12:12 AM
> > > > > Jun
> > > > > > > Rao
> > > > > > > > > >> > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks for the KIP.
> Sorry
> > for
> > > > > the
> > > > > > > late
> > > > > > > > > >> reply.
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > The motivation of the
> KIP
> > is
> > > > to
> > > > > > > > improve
> > > > > > > > > >> the
> > > > > > > > > >> > > > > > efficiency
> > > > > > > > > >> > > > > > > of
> > > > > > > > > >> > > > > > > > > size
> > > > > > > > > >> > > > > > > > > > > > > based
> > > > > > > > > >> > > > > > > > > > > > > > > > > retention. I am not sure
> > the
> > > > > > > proposed
> > > > > > > > > >> changes
> > > > > > > > > >> > > are
> > > > > > > > > >> > > > > > > enough.
> > > > > > > > > >> > > > > > > > > For
> > > > > > > > > >> > > > > > > > > > > > > > example,
> > > > > > > > > >> > > > > > > > > > > > > > > if
> > > > > > > > > >> > > > > > > > > > > > > > > > > the size exceeds the
> > retention
> > > > > size,
> > > > > > > > we
> > > > > > > > > >> need
> > > > > > > > > >> > to
> > > > > > > > > >> > > > > > > determine
> > > > > > > > > >> > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > subset
> > > > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > > > > > segments to delete to
> > bring
> > > > the
> > > > > size
> > > > > > > > > >> within
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > > > retention
> > > > > > > > > >> > > > > > > > > > > limit.
> > > > > > > > > >> > > > > > > > > > > > Do
> > > > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > > > >> > > > > > > > > > > > > > > > need
> > > > > > > > > >> > > > > > > > > > > > > > > > > to call
> > > > > > > > > >> > > > > > >
> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > determine
> > > > > > > > > >> > > > > > > > > > > > > > > > that?
> > > > > > > > > >> > > > > > > > > > > > > > > > > Also, what about
> > time-based
> > > > > > > retention?
> > > > > > > > > To
> > > > > > > > > >> > make
> > > > > > > > > >> > > > that
> > > > > > > > > >> > > > > > > > > efficient,
> > > > > > > > > >> > > > > > > > > > > do
> > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > > > >> > > > > > > > > > > > > > > > > to make some additional
> > > > > interface
> > > > > > > > > changes?
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > An alternative approach
> > is for
> > > > > the
> > > > > > > > RLMM
> > > > > > > > > >> > > > implementor
> > > > > > > > > >> > > > > > to
> > > > > > > > > >> > > > > > > > make
> > > > > > > > > >> > > > > > > > > > > sure
> > > > > > > > > >> > > > > > > > > > > > > > > > > that
> > > > > > > > > >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > > > is
> > > > > > > > > >> > > > > > > > > fast
> > > > > > > > > >> > > > > > > > > > > > > (e.g.,
> > > > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > > > >> > > > > > > > > > > > > > > > > local caching). This
> way,
> > we
> > > > > could
> > > > > > > > keep
> > > > > > > > > >> the
> > > > > > > > > >> > > > > interface
> > > > > > > > > >> > > > > > > > > simple.
> > > > > > > > > >> > > > > > > > > > > > Have
> > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > >> > > > > > > > > > > > > > > > > considered that?
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > Jun
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at
> > 6:28
> > > > AM
> > > > > > > Divij
> > > > > > > > > >> Vaidya
> > > > > > > > > >> > <
> > > > > > > > > >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > Does anyone else have
> > any
> > > > > thoughts
> > > > > > > > on
> > > > > > > > > >> this
> > > > > > > > > >> > > > > before I
> > > > > > > > > >> > > > > > > > > propose
> > > > > > > > > >> > > > > > > > > > > > this
> > > > > > > > > >> > > > > > > > > > > > > > for
> > > > > > > > > >> > > > > > > > > > > > > > > a
> > > > > > > > > >> > > > > > > > > > > > > > > > > > vote?
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > --
> > > > > > > > > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at
> > 12:57
> > > > > PM
> > > > > > > > Satish
> > > > > > > > > >> > > Duggana
> > > > > > > > > >> > > > <
> > > > > > > > > >> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP
> > Divij!
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > This is a nice
> > improvement
> > > > > to
> > > > > > > > avoid
> > > > > > > > > >> > > > > recalculation
> > > > > > > > > >> > > > > > > of
> > > > > > > > > >> > > > > > > > > size.
> > > > > > > > > >> > > > > > > > > > > > > > > Customized
> > > > > > > > > >> > > > > > > > > > > > > > > > > > RLMMs
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > can implement the
> best
> > > > > possible
> > > > > > > > > >> approach
> > > > > > > > > >> > by
> > > > > > > > > >> > > > > > caching
> > > > > > > > > >> > > > > > > > or
> > > > > > > > > >> > > > > > > > > > > > > > maintaining
> > > > > > > > > >> > > > > > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > > > > size
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > in an efficient way.
> > But
> > > > > this is
> > > > > > > > > not a
> > > > > > > > > >> > big
> > > > > > > > > >> > > > > > concern
> > > > > > > > > >> > > > > > > > for
> > > > > > > > > >> > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > default
> > > > > > > > > >> > > > > > > > > > > > > > > > > topic
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > based RLMM as
> > mentioned in
> > > > > the
> > > > > > > > KIP.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022
> at
> > > > > 18:48,
> > > > > > > > Divij
> > > > > > > > > >> > Vaidya
> > > > > > > > > >> > > <
> > > > > > > > > >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Thank you for your
> > > > review
> > > > > > > Luke.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that
> > would the
> > > > > new
> > > > > > > > > >> > > > > > `RemoteLogSizeBytes`
> > > > > > > > > >> > > > > > > > > metric
> > > > > > > > > >> > > > > > > > > > > > be a
> > > > > > > > > >> > > > > > > > > > > > > > > > > > performance
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > overhead? Although
> > we
> > > > > move the
> > > > > > > > > >> > > calculation
> > > > > > > > > >> > > > > to a
> > > > > > > > > >> > > > > > > > > seperate
> > > > > > > > > >> > > > > > > > > > > > API,
> > > > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > > > >> > > > > > > > > > > > > > > > > still
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > can't assume users
> > will
> > > > > > > > implement
> > > > > > > > > a
> > > > > > > > > >> > > > > > light-weight
> > > > > > > > > >> > > > > > > > > method,
> > > > > > > > > >> > > > > > > > > > > > > right?
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > This metric would
> be
> > > > > logged
> > > > > > > > using
> > > > > > > > > >> the
> > > > > > > > > >> > > > > > information
> > > > > > > > > >> > > > > > > > > that is
> > > > > > > > > >> > > > > > > > > > > > > > already
> > > > > > > > > >> > > > > > > > > > > > > > > > > being
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > calculated for
> > handling
> > > > > remote
> > > > > > > > > >> > retention
> > > > > > > > > >> > > > > logic,
> > > > > > > > > >> > > > > > > > > hence, no
> > > > > > > > > >> > > > > > > > > > > > > > > > additional
> > > > > > > > > >> > > > > > > > > > > > > > > > > > work
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > is required to
> > calculate
> > > > > this
> > > > > > > > > >> metric.
> > > > > > > > > >> > > More
> > > > > > > > > >> > > > > > > > > specifically,
> > > > > > > > > >> > > > > > > > > > > > > > whenever
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager
> > calls
> > > > > > > > > >> getRemoteLogSize
> > > > > > > > > >> > > > API,
> > > > > > > > > >> > > > > > this
> > > > > > > > > >> > > > > > > > > metric
> > > > > > > > > >> > > > > > > > > > > > > would
> > > > > > > > > >> > > > > > > > > > > > > > be
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > captured.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > This API call is
> > made
> > > > > every
> > > > > > > time
> > > > > > > > > >> > > > > > RemoteLogManager
> > > > > > > > > >> > > > > > > > > wants
> > > > > > > > > >> > > > > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > > handle
> > > > > > > > > >> > > > > > > > > > > > > > > > > > expired
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > remote log
> segments
> > > > (which
> > > > > > > > should
> > > > > > > > > be
> > > > > > > > > >> > > > > periodic).
> > > > > > > > > >> > > > > > > > Does
> > > > > > > > > >> > > > > > > > > that
> > > > > > > > > >> > > > > > > > > > > > > > address
> > > > > > > > > >> > > > > > > > > > > > > > > > > your
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > concern?
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12,
> > 2022 at
> > > > > 11:01
> > > > > > > AM
> > > > > > > > > >> Luke
> > > > > > > > > >> > > Chen
> > > > > > > > > >> > > > <
> > > > > > > > > >> > > > > > > > > > > > > showuon@gmail.com>
> > > > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the
> > KIP!
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > I think it makes
> > sense
> > > > > to
> > > > > > > > > delegate
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > > > > > responsibility
> > > > > > > > > >> > > > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > > > > > calculation
> > > > > > > > > >> > > > > > > > > > > > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > specific
> > > > > > > > > RemoteLogMetadataManager
> > > > > > > > > >> > > > > > > implementation.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > But one thing
> I'm
> > not
> > > > > quite
> > > > > > > > > sure,
> > > > > > > > > >> is
> > > > > > > > > >> > > that
> > > > > > > > > >> > > > > > would
> > > > > > > > > >> > > > > > > > > the new
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > `RemoteLogSizeBytes`
> > > > > metric
> > > > > > > > be a
> > > > > > > > > >> > > > > performance
> > > > > > > > > >> > > > > > > > > overhead?
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Although we move
> > the
> > > > > > > > calculation
> > > > > > > > > >> to a
> > > > > > > > > >> > > > > > seperate
> > > > > > > > > >> > > > > > > > > API, we
> > > > > > > > > >> > > > > > > > > > > > > still
> > > > > > > > > >> > > > > > > > > > > > > > > > can't
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > assume
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > users will
> > implement a
> > > > > > > > > >> light-weight
> > > > > > > > > >> > > > method,
> > > > > > > > > >> > > > > > > > right?
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1,
> > 2022 at
> > > > > 5:47
> > > > > > > PM
> > > > > > > > > >> Divij
> > > > > > > > > >> > > > > Vaidya <
> > > > > > > > > >> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Please take a
> > look
> > > > at
> > > > > this
> > > > > > > > KIP
> > > > > > > > > >> > which
> > > > > > > > > >> > > > > > proposes
> > > > > > > > > >> > > > > > > > an
> > > > > > > > > >> > > > > > > > > > > > > extension
> > > > > > > > > >> > > > > > > > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > This
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > is my first
> KIP
> > with
> > > > > > > Apache
> > > > > > > > > >> Kafka
> > > > > > > > > >> > > > > community
> > > > > > > > > >> > > > > > > so
> > > > > > > > > >> > > > > > > > > any
> > > > > > > > > >> > > > > > > > > > > > > feedback
> > > > > > > > > >> > > > > > > > > > > > > > > > would
> > > > > > > > > >> > > > > > > > > > > > > > > > > > be
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > highly
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software
> > > > Engineer
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Divij Vaidya <di...@gmail.com>.
Thank you for spotting that Luke. I have fixed the snippet now.

*Satish*, I am waiting for your review on this one since you provided some
comments earlier in this discussion. Please check the KIP once again when
you get a chance and vote at
https://lists.apache.org/thread/soz00990gvzodv7oyqj4ysvktrqy6xfk if you
don't have any further concerns. Thank you!!

--
Divij Vaidya



On Thu, Jul 13, 2023 at 2:39 PM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Thanks Divij.
>
> I was confusing with the metric tags used by clients that are based on
> topic and partition. Ideally partition label could be at a DEBUG recording
> level, but that's outside the scope of this KIP.
>
> Looks good to me, thanks again!
>
> Jorge.
>
> On Wed, 12 Jul 2023 at 15:55, Divij Vaidya <di...@gmail.com>
> wrote:
>
> > Jorge,
> > About API name: Good point. I have changed it to remoteLogSize instead of
> > getRemoteLogSize
> >
> > About partition tag in the metric: We don't use partition tag across any
> of
> > the RemoteStorage metrics and I would like to keep this metric aligned
> with
> > the rest. I will change the metric though to type=BrokerTopicMetrics
> > instead of type=RemoteLogManager, since this is topic level information
> and
> > not specific to RemoteLogManager.
> >
> >
> > Satish,
> > Ah yes! Updated from "This would increase the broker start-up time." to
> > "This would increase the bootstrap time for the remote storage thread
> pool
> > before the first eligible segment is archived."
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Mon, Jul 3, 2023 at 2:07 PM Satish Duggana <sa...@gmail.com>
> > wrote:
> >
> > > Thanks Divij for taking the feedback and updating the motivation
> > > section in the KIP.
> > >
> > > One more comment on Alternative solution-3, The con is not valid as
> > > that will not affect the broker restart times as discussed in the
> > > earlier email in this thread. You may want to update that.
> > >
> > > ~Satish.
> > >
> > > On Sun, 2 Jul 2023 at 01:03, Divij Vaidya <di...@gmail.com>
> > wrote:
> > > >
> > > > Thank you folks for reviewing this KIP.
> > > >
> > > > Satish, I have modified the motivation to make it more clear. Now it
> > > says,
> > > > "Since the main feature of tiered storage is storing a large amount
> of
> > > > data, we expect num_remote_segments to be large. A frequent linear
> scan
> > > > (i.e. listing all segment metadata) could be expensive/slower because
> > of
> > > > the underlying storage used by RemoteLogMetadataManager. This
> slowness
> > to
> > > > list all segment metadata could result in the loss of
> availability...."
> > > >
> > > > Jun, Kamal, Satish, if you don't have any further concerns, I would
> > > > appreciate a vote for this KIP in the voting thread -
> > > > https://lists.apache.org/thread/soz00990gvzodv7oyqj4ysvktrqy6xfk
> > > >
> > > > --
> > > > Divij Vaidya
> > > >
> > > >
> > > >
> > > > On Sat, Jul 1, 2023 at 6:16 AM Kamal Chandraprakash <
> > > > kamal.chandraprakash@gmail.com> wrote:
> > > >
> > > > > Hi Divij,
> > > > >
> > > > > Thanks for the explanation. LGTM.
> > > > >
> > > > > --
> > > > > Kamal
> > > > >
> > > > > On Sat, Jul 1, 2023 at 7:28 AM Satish Duggana <
> > > satish.duggana@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Divij,
> > > > > > I am fine with having an API to compute the size as I mentioned
> in
> > my
> > > > > > earlier reply in this mail thread. But I have the below comment
> for
> > > > > > the motivation for this KIP.
> > > > > >
> > > > > > As you discussed offline, the main issue here is listing calls
> for
> > > > > > remote log segment metadata is slower because of the storage used
> > for
> > > > > > RLMM. These can be avoided with this new API.
> > > > > >
> > > > > > Please add this in the motivation section as it is one of the
> main
> > > > > > motivations for the KIP.
> > > > > >
> > > > > > Thanks,
> > > > > > Satish.
> > > > > >
> > > > > > On Sat, 1 Jul 2023 at 01:43, Jun Rao <ju...@confluent.io.invalid>
> > > wrote:
> > > > > > >
> > > > > > > Hi, Divij,
> > > > > > >
> > > > > > > Sorry for the late reply.
> > > > > > >
> > > > > > > Given your explanation, the new API sounds reasonable to me. Is
> > > that
> > > > > > enough
> > > > > > > to build the external metadata layer for the remote segments or
> > do
> > > you
> > > > > > need
> > > > > > > some additional API changes?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <
> > > divijvaidya13@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thank you for looking into this Kamal.
> > > > > > > >
> > > > > > > > You are right in saying that a cold start (i.e. leadership
> > > failover
> > > > > or
> > > > > > > > broker startup) does not impact the broker startup duration.
> > But
> > > it
> > > > > > does
> > > > > > > > have the following impact:
> > > > > > > > 1. It leads to a burst of full-scan requests to RLMM in case
> > > multiple
> > > > > > > > leadership failovers occur at the same time. Even if the RLMM
> > > > > > > > implementation has the capability to serve the total size
> from
> > an
> > > > > index
> > > > > > > > (and hence handle this burst), we wouldn't be able to use it
> > > since
> > > > > the
> > > > > > > > current API necessarily calls for a full scan.
> > > > > > > > 2. The archival (copying of data to tiered storage) process
> > will
> > > > > have a
> > > > > > > > delayed start. The delayed start of archival could lead to
> > local
> > > > > build
> > > > > > up
> > > > > > > > of data which may lead to disk full.
> > > > > > > >
> > > > > > > > The disadvantage of adding this new API is that every
> provider
> > > will
> > > > > > have to
> > > > > > > > implement it, agreed. But I believe that this tradeoff is
> > > worthwhile
> > > > > > since
> > > > > > > > the default implementation could be the same as you
> mentioned,
> > > i.e.
> > > > > > keeping
> > > > > > > > cumulative in-memory count.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Divij Vaidya
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
> > > > > > > > kamal.chandraprakash@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi Divij,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP! Sorry for the late reply.
> > > > > > > > >
> > > > > > > > > Can you explain the rejected alternative-3?
> > > > > > > > > Store the cumulative size of remote tier log in-memory at
> > > > > > > > RemoteLogManager
> > > > > > > > > "*Cons*: Every time a broker starts-up, it will scan
> through
> > > all
> > > > > the
> > > > > > > > > segments in the remote tier to initialise the in-memory
> > value.
> > > This
> > > > > > would
> > > > > > > > > increase the broker start-up time."
> > > > > > > > >
> > > > > > > > > Keeping the source of truth to determine the
> remote-log-size
> > > in the
> > > > > > > > leader
> > > > > > > > > would be consistent across different implementations of the
> > > plugin.
> > > > > > The
> > > > > > > > > concern posted in the KIP is that we are calculating the
> > > > > > remote-log-size
> > > > > > > > on
> > > > > > > > > each iteration of the cleaner thread (say 5 mins). If we
> > > calculate
> > > > > > only
> > > > > > > > > once during broker startup or during the leadership
> > > reassignment,
> > > > > do
> > > > > > we
> > > > > > > > > still need the cache?
> > > > > > > > >
> > > > > > > > > The broker startup-time won't be affected by the remote log
> > > manager
> > > > > > > > > initialisation. The broker continue to start accepting the
> > new
> > > > > > > > > produce/fetch requests, while the RLM thread in the
> > background
> > > can
> > > > > > > > > determine the remote-log-size once and start
> copying/deleting
> > > the
> > > > > > > > segments.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Kamal
> > > > > > > > >
> > > > > > > > > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <
> > > > > divijvaidya13@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Satish / Jun
> > > > > > > > > >
> > > > > > > > > > Do you have any thoughts on this?
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Divij Vaidya
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <
> > > > > > divijvaidya13@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey Jun
> > > > > > > > > > >
> > > > > > > > > > > It has been a while since this KIP got some attention.
> > > While we
> > > > > > wait
> > > > > > > > > for
> > > > > > > > > > > Satish to chime in here, perhaps I can answer your
> > > question.
> > > > > > > > > > >
> > > > > > > > > > > > Could you explain how you exposed the log size in
> your
> > > > > KIP-405
> > > > > > > > > > > implementation?
> > > > > > > > > > >
> > > > > > > > > > > The APIs available in RLMM as per KIP405
> > > > > > > > > > > are, addRemoteLogSegmentMetadata(),
> > > > > > updateRemoteLogSegmentMetadata(),
> > > > > > > > > > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > > > > > > > > > putRemotePartitionDeleteMetadata(),
> > listRemoteLogSegments(),
> > > > > > > > > > onPartitionLeadershipChanges()
> > > > > > > > > > > and onStopPartitions(). None of these APIs allow us to
> > > expose
> > > > > > the log
> > > > > > > > > > size,
> > > > > > > > > > > hence, the only option that remains is to list all
> > segments
> > > > > using
> > > > > > > > > > > listRemoteLogSegments() and aggregate them every time
> we
> > > > > require
> > > > > > to
> > > > > > > > > > > calculate the size. Based on our prior discussion, this
> > > > > requires
> > > > > > > > > reading
> > > > > > > > > > > all segment metadata which won't work for non-local
> RLMM
> > > > > > > > > implementations.
> > > > > > > > > > > Satish's implementation also performs a full scan and
> > > > > calculates
> > > > > > the
> > > > > > > > > > > aggregate. see:
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Does this answer your question?
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao
> > > > > <jun@confluent.io.invalid
> > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> Hi, Divij,
> > > > > > > > > > >>
> > > > > > > > > > >> Thanks for the explanation.
> > > > > > > > > > >>
> > > > > > > > > > >> Good question.
> > > > > > > > > > >>
> > > > > > > > > > >> Hi, Satish,
> > > > > > > > > > >>
> > > > > > > > > > >> Could you explain how you exposed the log size in your
> > > KIP-405
> > > > > > > > > > >> implementation?
> > > > > > > > > > >>
> > > > > > > > > > >> Thanks,
> > > > > > > > > > >>
> > > > > > > > > > >> Jun
> > > > > > > > > > >>
> > > > > > > > > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <
> > > > > > > > divijvaidya13@gmail.com
> > > > > > > > > >
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Hey Jun
> > > > > > > > > > >> >
> > > > > > > > > > >> > Yes, it is possible to maintain the log size in the
> > > cache
> > > > > (see
> > > > > > > > > > rejected
> > > > > > > > > > >> > alternative#3 in the KIP) but I did not understand
> how
> > > it is
> > > > > > > > > possible
> > > > > > > > > > to
> > > > > > > > > > >> > retrieve it without the new API. The log size could
> be
> > > > > > calculated
> > > > > > > > on
> > > > > > > > > > >> > startup by scanning through the segments (though I
> > would
> > > > > > disagree
> > > > > > > > > that
> > > > > > > > > > >> this
> > > > > > > > > > >> > is the right approach since scanning itself takes
> > order
> > > of
> > > > > > minutes
> > > > > > > > > and
> > > > > > > > > > >> > hence delay the start of archive process), and
> > > incrementally
> > > > > > > > > > maintained
> > > > > > > > > > >> > afterwards, even then, we would need an API in
> > > > > > > > > > RemoteLogMetadataManager
> > > > > > > > > > >> so
> > > > > > > > > > >> > that RLM could fetch the cached size!
> > > > > > > > > > >> >
> > > > > > > > > > >> > If we wish to cache the size without adding a new
> API,
> > > then
> > > > > we
> > > > > > > > need
> > > > > > > > > to
> > > > > > > > > > >> > cache the size in RLM itself (instead of RLMM
> > > > > implementation)
> > > > > > and
> > > > > > > > > > >> > incrementally manage it. The downside of longer
> > archive
> > > time
> > > > > > at
> > > > > > > > > > startup
> > > > > > > > > > >> > (due to initial scale) still remains valid in this
> > > > > situation.
> > > > > > > > > > >> >
> > > > > > > > > > >> > --
> > > > > > > > > > >> > Divij Vaidya
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao
> > > > > > <jun@confluent.io.invalid
> > > > > > > > >
> > > > > > > > > > >> wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > Hi, Divij,
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks for the explanation.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > If there is in-memory cache, could we maintain the
> > log
> > > > > size
> > > > > > in
> > > > > > > > the
> > > > > > > > > > >> cache
> > > > > > > > > > >> > > with the existing API? For example, a replica
> could
> > > make a
> > > > > > > > > > >> > > listRemoteLogSegments(TopicIdPartition
> > > topicIdPartition)
> > > > > > call on
> > > > > > > > > > >> startup
> > > > > > > > > > >> > to
> > > > > > > > > > >> > > get the remote segment size before the current
> > > > > leaderEpoch.
> > > > > > The
> > > > > > > > > > leader
> > > > > > > > > > >> > > could then maintain the size incrementally
> > > afterwards. On
> > > > > > leader
> > > > > > > > > > >> change,
> > > > > > > > > > >> > > other replicas can make a
> > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > > >> > > topicIdPartition, int leaderEpoch) call to get the
> > > size of
> > > > > > newly
> > > > > > > > > > >> > generated
> > > > > > > > > > >> > > segments.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks,
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Jun
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > > > >> >
> > > > > > > > > > >> > > wrote:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > > Is the new method enough for doing size-based
> > > > > retention?
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > Yes. You are right in assuming that this API
> only
> > > > > > provides the
> > > > > > > > > > >> Remote
> > > > > > > > > > >> > > > storage size (for current epoch chain). We would
> > use
> > > > > this
> > > > > > API
> > > > > > > > > for
> > > > > > > > > > >> size
> > > > > > > > > > >> > > > based retention along with a value of
> > > > > > localOnlyLogSegmentSize
> > > > > > > > > > which
> > > > > > > > > > >> is
> > > > > > > > > > >> > > > computed as
> > > > > > Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > > > > > > > > >> > > > highestOffsetWithRemoteIndex)). Hence,
> > > (total_log_size =
> > > > > > > > > > >> > > > remoteLogSizeBytes +
> > log.localOnlyLogSegmentSize). I
> > > > > have
> > > > > > > > > updated
> > > > > > > > > > >> the
> > > > > > > > > > >> > KIP
> > > > > > > > > > >> > > > with this information. You can also check an
> > example
> > > > > > > > > > implementation
> > > > > > > > > > >> at
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Do you imagine all accesses to remote metadata
> > > will be
> > > > > > > > across
> > > > > > > > > > the
> > > > > > > > > > >> > > network
> > > > > > > > > > >> > > > or will there be some local in-memory cache?
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > I would expect a disk-less implementation to
> > > maintain a
> > > > > > finite
> > > > > > > > > > >> > in-memory
> > > > > > > > > > >> > > > cache for segment metadata to optimize the
> number
> > of
> > > > > > network
> > > > > > > > > calls
> > > > > > > > > > >> made
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > fetch the data. In future, we can think about
> > > bringing
> > > > > > this
> > > > > > > > > finite
> > > > > > > > > > >> size
> > > > > > > > > > >> > > > cache into RLM itself but that's probably a
> > > conversation
> > > > > > for a
> > > > > > > > > > >> > different
> > > > > > > > > > >> > > > KIP. There are many other things we would like
> to
> > > do to
> > > > > > > > optimize
> > > > > > > > > > the
> > > > > > > > > > >> > > Tiered
> > > > > > > > > > >> > > > storage interface such as introducing a circular
> > > buffer
> > > > > /
> > > > > > > > > > streaming
> > > > > > > > > > >> > > > interface from RSM (so that we don't have to
> wait
> > to
> > > > > > fetch the
> > > > > > > > > > >> entire
> > > > > > > > > > >> > > > segment before starting to send records to the
> > > > > consumer),
> > > > > > > > > caching
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > segments fetched from RSM locally (I would
> assume
> > > all
> > > > > RSM
> > > > > > > > plugin
> > > > > > > > > > >> > > > implementations to do this, might as well add it
> > to
> > > RLM)
> > > > > > etc.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > --
> > > > > > > > > > >> > > > Divij Vaidya
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > > >
> > > > > > > > > > >> > > wrote:
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Hi, Divij,
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Thanks for the reply.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Is the new method enough for doing size-based
> > > > > > retention? It
> > > > > > > > > > gives
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > total
> > > > > > > > > > >> > > > > size of the remote segments, but it seems that
> > we
> > > > > still
> > > > > > > > don't
> > > > > > > > > > know
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > exact total size for a log since there could
> be
> > > > > > overlapping
> > > > > > > > > > >> segments
> > > > > > > > > > >> > > > > between the remote and the local segments.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > You mentioned a disk-less implementation. Do
> you
> > > > > > imagine all
> > > > > > > > > > >> accesses
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > > remote metadata will be across the network or
> > will
> > > > > > there be
> > > > > > > > > some
> > > > > > > > > > >> > local
> > > > > > > > > > >> > > > > in-memory cache?
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Thanks,
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Jun
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > > > > > > > > > >> divijvaidya13@gmail.com
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > > The method is needed for RLMM
> implementations
> > > which
> > > > > > fetch
> > > > > > > > > the
> > > > > > > > > > >> > > > information
> > > > > > > > > > >> > > > > > over the network and not for the disk based
> > > > > > > > implementations
> > > > > > > > > > >> (such
> > > > > > > > > > >> > as
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > default topic based RLMM).
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > I would argue that adding this API makes the
> > > > > interface
> > > > > > > > more
> > > > > > > > > > >> generic
> > > > > > > > > > >> > > > than
> > > > > > > > > > >> > > > > > what it is today. This is because, with the
> > > current
> > > > > > APIs
> > > > > > > > an
> > > > > > > > > > >> > > implementor
> > > > > > > > > > >> > > > > is
> > > > > > > > > > >> > > > > > restricted to use disk based RLMM solutions
> > only
> > > > > > (i.e. the
> > > > > > > > > > >> default
> > > > > > > > > > >> > > > > > solution) whereas if we add this new API, we
> > > unblock
> > > > > > usage
> > > > > > > > > of
> > > > > > > > > > >> > network
> > > > > > > > > > >> > > > > based
> > > > > > > > > > >> > > > > > RLMM implementations such as databases.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > > >> >
> > > > > > > > > > >> > > > wrote:
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > > Hi, Divij,
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Thanks for the reply.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Point#2. My high level question is that is
> > > the new
> > > > > > > > method
> > > > > > > > > > >> needed
> > > > > > > > > > >> > > for
> > > > > > > > > > >> > > > > > every
> > > > > > > > > > >> > > > > > > implementation of remote storage or just
> > for a
> > > > > > specific
> > > > > > > > > > >> > > > implementation.
> > > > > > > > > > >> > > > > > The
> > > > > > > > > > >> > > > > > > issues that you pointed out exist for the
> > > default
> > > > > > > > > > >> implementation
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > > RLMM
> > > > > > > > > > >> > > > > > as
> > > > > > > > > > >> > > > > > > well and so far, the default
> implementation
> > > hasn't
> > > > > > > > found a
> > > > > > > > > > >> need
> > > > > > > > > > >> > > for a
> > > > > > > > > > >> > > > > > > similar new method. For public interface,
> > > ideally
> > > > > we
> > > > > > > > want
> > > > > > > > > to
> > > > > > > > > > >> make
> > > > > > > > > > >> > > it
> > > > > > > > > > >> > > > > more
> > > > > > > > > > >> > > > > > > general.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Thanks,
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Jun
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij
> > Vaidya <
> > > > > > > > > > >> > > > divijvaidya13@gmail.com>
> > > > > > > > > > >> > > > > > > wrote:
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > Thank you Jun and Alex for your
> comments.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > Point#1: You are right Jun. As Alex
> > > mentioned,
> > > > > the
> > > > > > > > > > "derived
> > > > > > > > > > >> > > > metadata"
> > > > > > > > > > >> > > > > > can
> > > > > > > > > > >> > > > > > > > increase the size of cached metadata by
> a
> > > factor
> > > > > > of 10
> > > > > > > > > but
> > > > > > > > > > >> it
> > > > > > > > > > >> > > > should
> > > > > > > > > > >> > > > > be
> > > > > > > > > > >> > > > > > > ok
> > > > > > > > > > >> > > > > > > > to cache just the actual metadata. My
> > point
> > > > > about
> > > > > > size
> > > > > > > > > > >> being a
> > > > > > > > > > >> > > > > > limitation
> > > > > > > > > > >> > > > > > > > for using cache is not valid anymore.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > Point#2: For a new replica, it would
> still
> > > have
> > > > > to
> > > > > > > > fetch
> > > > > > > > > > the
> > > > > > > > > > >> > > > metadata
> > > > > > > > > > >> > > > > > > over
> > > > > > > > > > >> > > > > > > > the network to initiate the warm up of
> the
> > > cache
> > > > > > and
> > > > > > > > > > hence,
> > > > > > > > > > >> > > > increase
> > > > > > > > > > >> > > > > > the
> > > > > > > > > > >> > > > > > > > start time of the archival process.
> Please
> > > also
> > > > > > note
> > > > > > > > the
> > > > > > > > > > >> > > > > repercussions
> > > > > > > > > > >> > > > > > of
> > > > > > > > > > >> > > > > > > > the warm up scan that Alex mentioned in
> > this
> > > > > > thread as
> > > > > > > > > > part
> > > > > > > > > > >> of
> > > > > > > > > > >> > > > > #102.2.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying
> > > that.
> > > > > My
> > > > > > > > point
> > > > > > > > > > >> about
> > > > > > > > > > >> > > size
> > > > > > > > > > >> > > > > > being
> > > > > > > > > > >> > > > > > > a
> > > > > > > > > > >> > > > > > > > limitation for using cache is not valid
> > > anymore.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > 101#: Alex, if I understand correctly,
> you
> > > are
> > > > > > > > > suggesting
> > > > > > > > > > to
> > > > > > > > > > >> > > cache
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > total size at the leader and update it
> on
> > > > > > archival.
> > > > > > > > This
> > > > > > > > > > >> > wouldn't
> > > > > > > > > > >> > > > > work
> > > > > > > > > > >> > > > > > > for
> > > > > > > > > > >> > > > > > > > cases when the leader restarts where we
> > > would
> > > > > > have to
> > > > > > > > > > make a
> > > > > > > > > > >> > full
> > > > > > > > > > >> > > > > scan
> > > > > > > > > > >> > > > > > > > to update the total size entry on
> startup.
> > > We
> > > > > > expect
> > > > > > > > > users
> > > > > > > > > > >> to
> > > > > > > > > > >> > > store
> > > > > > > > > > >> > > > > > data
> > > > > > > > > > >> > > > > > > > over longer duration in remote storage
> > which
> > > > > > increases
> > > > > > > > > the
> > > > > > > > > > >> > > > likelihood
> > > > > > > > > > >> > > > > > of
> > > > > > > > > > >> > > > > > > > leader restarts / failovers.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > 102#.1: I don't think that the current
> > > design
> > > > > > > > > accommodates
> > > > > > > > > > >> the
> > > > > > > > > > >> > > fact
> > > > > > > > > > >> > > > > > that
> > > > > > > > > > >> > > > > > > > data corruption could happen at the RLMM
> > > plugin
> > > > > > (we
> > > > > > > > > don't
> > > > > > > > > > >> have
> > > > > > > > > > >> > > > > checksum
> > > > > > > > > > >> > > > > > > as
> > > > > > > > > > >> > > > > > > > a field in metadata as part of KIP405).
> If
> > > data
> > > > > > > > > corruption
> > > > > > > > > > >> > > occurs,
> > > > > > > > > > >> > > > w/
> > > > > > > > > > >> > > > > > or
> > > > > > > > > > >> > > > > > > > w/o the cache, it would be a different
> > > problem
> > > > > to
> > > > > > > > > solve. I
> > > > > > > > > > >> > would
> > > > > > > > > > >> > > > like
> > > > > > > > > > >> > > > > > to
> > > > > > > > > > >> > > > > > > > keep this outside the scope of this KIP.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > 102#.2: Agree. This remains as the main
> > > concern
> > > > > > for
> > > > > > > > > using
> > > > > > > > > > >> the
> > > > > > > > > > >> > > cache
> > > > > > > > > > >> > > > > to
> > > > > > > > > > >> > > > > > > > fetch total size.
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > Regards,
> > > > > > > > > > >> > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM
> Alexandre
> > > > > > Dupriez <
> > > > > > > > > > >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > > Hi Divij,
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > Thanks for the KIP. Please find some
> > > comments
> > > > > > based
> > > > > > > > on
> > > > > > > > > > >> what I
> > > > > > > > > > >> > > > read
> > > > > > > > > > >> > > > > on
> > > > > > > > > > >> > > > > > > > > this thread so far - apologies for the
> > > repeats
> > > > > > and
> > > > > > > > the
> > > > > > > > > > >> late
> > > > > > > > > > >> > > > reply.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > If I understand correctly, one of the
> > main
> > > > > > elements
> > > > > > > > of
> > > > > > > > > > >> > > discussion
> > > > > > > > > > >> > > > > is
> > > > > > > > > > >> > > > > > > > > about caching in Kafka versus
> delegation
> > > of
> > > > > > > > providing
> > > > > > > > > > the
> > > > > > > > > > >> > > remote
> > > > > > > > > > >> > > > > size
> > > > > > > > > > >> > > > > > > > > of a topic-partition to the plugin.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > A few comments:
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 100. The size of the “derived
> metadata”
> > > which
> > > > > is
> > > > > > > > > managed
> > > > > > > > > > >> by
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > > plugin
> > > > > > > > > > >> > > > > > > > > to represent an rlmMetadata can indeed
> > be
> > > > > close
> > > > > > to 1
> > > > > > > > > kB
> > > > > > > > > > on
> > > > > > > > > > >> > > > average
> > > > > > > > > > >> > > > > > > > > depending on its own internal
> structure,
> > > e.g.
> > > > > > the
> > > > > > > > > > >> redundancy
> > > > > > > > > > >> > it
> > > > > > > > > > >> > > > > > > > > enforces (unfortunately resulting to
> > > > > > duplication),
> > > > > > > > > > >> additional
> > > > > > > > > > >> > > > > > > > > information such as checksums and
> > primary
> > > and
> > > > > > > > > secondary
> > > > > > > > > > >> > > indexable
> > > > > > > > > > >> > > > > > > > > keys. But indeed, the rlmMetadata is
> > > itself a
> > > > > > > > lighter
> > > > > > > > > > data
> > > > > > > > > > >> > > > > structure
> > > > > > > > > > >> > > > > > > > > by a factor of 10. And indeed, instead
> > of
> > > > > > caching
> > > > > > > > the
> > > > > > > > > > >> > “derived
> > > > > > > > > > >> > > > > > > > > metadata”, only the rlmMetadata could
> > be,
> > > > > which
> > > > > > > > should
> > > > > > > > > > >> > address
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > > concern regarding the memory occupancy
> > of
> > > the
> > > > > > cache.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 101. I am not sure I fully understand
> > why
> > > we
> > > > > > would
> > > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > >> > > cache
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > > list of rlmMetadata to retain the
> remote
> > > size
> > > > > > of a
> > > > > > > > > > >> > > > topic-partition.
> > > > > > > > > > >> > > > > > > > > Since the leader of a topic-partition
> > is,
> > > in
> > > > > > > > > > >> non-degenerated
> > > > > > > > > > >> > > > cases,
> > > > > > > > > > >> > > > > > > > > the only actor which can mutate the
> > remote
> > > > > part
> > > > > > of
> > > > > > > > the
> > > > > > > > > > >> > > > > > > > > topic-partition, hence its size, it
> > could
> > > in
> > > > > > theory
> > > > > > > > > only
> > > > > > > > > > >> > cache
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > > size of the remote log once it has
> > > calculated
> > > > > > it? In
> > > > > > > > > > which
> > > > > > > > > > >> > case
> > > > > > > > > > >> > > > > there
> > > > > > > > > > >> > > > > > > > > would not be any problem regarding the
> > > size of
> > > > > > the
> > > > > > > > > > caching
> > > > > > > > > > >> > > > > strategy.
> > > > > > > > > > >> > > > > > > > > Did I miss something there?
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 102. There may be a few challenges to
> > > consider
> > > > > > with
> > > > > > > > > > >> caching:
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 102.1) As mentioned above, the caching
> > > > > strategy
> > > > > > > > > assumes
> > > > > > > > > > no
> > > > > > > > > > >> > > > mutation
> > > > > > > > > > >> > > > > > > > > outside the lifetime of a leader.
> While
> > > this
> > > > > is
> > > > > > true
> > > > > > > > > in
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > normal
> > > > > > > > > > >> > > > > > > > > course of operation, there could be
> > > accidental
> > > > > > > > > mutation
> > > > > > > > > > >> > outside
> > > > > > > > > > >> > > > of
> > > > > > > > > > >> > > > > > the
> > > > > > > > > > >> > > > > > > > > leader and a loss of consistency
> between
> > > the
> > > > > > cached
> > > > > > > > > > state
> > > > > > > > > > >> and
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > > > > > actual remote representation of the
> log.
> > > E.g.
> > > > > > > > > > split-brain
> > > > > > > > > > >> > > > > scenarios,
> > > > > > > > > > >> > > > > > > > > bugs in the plugins, bugs in external
> > > systems
> > > > > > with
> > > > > > > > > > >> mutating
> > > > > > > > > > >> > > > access
> > > > > > > > > > >> > > > > on
> > > > > > > > > > >> > > > > > > > > the derived metadata. In the worst
> > case, a
> > > > > drift
> > > > > > > > > between
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > cached
> > > > > > > > > > >> > > > > > > > > size and the actual size could lead to
> > > > > > over-deleting
> > > > > > > > > > >> remote
> > > > > > > > > > >> > > data
> > > > > > > > > > >> > > > > > which
> > > > > > > > > > >> > > > > > > > > is a durability risk.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > The alternative you propose, by making
> > the
> > > > > > plugin
> > > > > > > > the
> > > > > > > > > > >> source
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > > truth
> > > > > > > > > > >> > > > > > > > > w.r.t. to the size of the remote log,
> > can
> > > make
> > > > > > it
> > > > > > > > > easier
> > > > > > > > > > >> to
> > > > > > > > > > >> > > avoid
> > > > > > > > > > >> > > > > > > > > inconsistencies between plugin-managed
> > > > > metadata
> > > > > > and
> > > > > > > > > the
> > > > > > > > > > >> > remote
> > > > > > > > > > >> > > > log
> > > > > > > > > > >> > > > > > > > > from the perspective of Kafka. On the
> > > other
> > > > > > hand,
> > > > > > > > > plugin
> > > > > > > > > > >> > > vendors
> > > > > > > > > > >> > > > > > would
> > > > > > > > > > >> > > > > > > > > have to implement it with the expected
> > > > > > efficiency to
> > > > > > > > > > have
> > > > > > > > > > >> it
> > > > > > > > > > >> > > > yield
> > > > > > > > > > >> > > > > > > > > benefits.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 102.2) As you mentioned, the caching
> > > strategy
> > > > > in
> > > > > > > > Kafka
> > > > > > > > > > >> would
> > > > > > > > > > >> > > > still
> > > > > > > > > > >> > > > > > > > > require one iteration over the list of
> > > > > > rlmMetadata
> > > > > > > > > when
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > > leadership
> > > > > > > > > > >> > > > > > > > > of a topic-partition is assigned to a
> > > broker,
> > > > > > while
> > > > > > > > > the
> > > > > > > > > > >> > plugin
> > > > > > > > > > >> > > > can
> > > > > > > > > > >> > > > > > > > > offer alternative constant-time
> > > approaches.
> > > > > This
> > > > > > > > > > >> calculation
> > > > > > > > > > >> > > > cannot
> > > > > > > > > > >> > > > > > be
> > > > > > > > > > >> > > > > > > > > put on the LeaderAndIsr path and would
> > be
> > > > > > performed
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > >> > > > > > background.
> > > > > > > > > > >> > > > > > > > > In case of bulk leadership migration,
> > > listing
> > > > > > the
> > > > > > > > > > >> rlmMetadata
> > > > > > > > > > >> > > > could
> > > > > > > > > > >> > > > > > a)
> > > > > > > > > > >> > > > > > > > > result in request bursts to any
> backend
> > > system
> > > > > > the
> > > > > > > > > > plugin
> > > > > > > > > > >> may
> > > > > > > > > > >> > > use
> > > > > > > > > > >> > > > > > > > > [which shouldn’t be a problem for
> > > > > > high-throughput
> > > > > > > > data
> > > > > > > > > > >> stores
> > > > > > > > > > >> > > but
> > > > > > > > > > >> > > > > > > > > could have cost implications] b)
> > increase
> > > > > > > > utilisation
> > > > > > > > > > >> > timespan
> > > > > > > > > > >> > > of
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > > RLM threads for these calculations
> > > potentially
> > > > > > > > leading
> > > > > > > > > > to
> > > > > > > > > > >> > > > transient
> > > > > > > > > > >> > > > > > > > > starvation of tasks queued for,
> > typically,
> > > > > > > > offloading
> > > > > > > > > > >> > > operations
> > > > > > > > > > >> > > > c)
> > > > > > > > > > >> > > > > > > > > could have a non-marginal CPU
> footprint
> > on
> > > > > > hardware
> > > > > > > > > with
> > > > > > > > > > >> > strict
> > > > > > > > > > >> > > > > > > > > resource constraints. All these
> elements
> > > could
> > > > > > have
> > > > > > > > an
> > > > > > > > > > >> impact
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > > some
> > > > > > > > > > >> > > > > > > > > degree depending on the operational
> > > > > environment.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > From a design perspective, one
> question
> > is
> > > > > > where we
> > > > > > > > > want
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > source
> > > > > > > > > > >> > > > > > of
> > > > > > > > > > >> > > > > > > > > truth w.r.t. remote log size to be
> > during
> > > the
> > > > > > > > lifetime
> > > > > > > > > > of
> > > > > > > > > > >> a
> > > > > > > > > > >> > > > leader.
> > > > > > > > > > >> > > > > > > > > The responsibility of maintaining a
> > > consistent
> > > > > > > > > > >> representation
> > > > > > > > > > >> > > of
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > > remote log is shared by Kafka and the
> > > plugin.
> > > > > > Which
> > > > > > > > > > >> system is
> > > > > > > > > > >> > > > best
> > > > > > > > > > >> > > > > > > > > placed to maintain such a state while
> > > > > providing
> > > > > > the
> > > > > > > > > > >> highest
> > > > > > > > > > >> > > > > > > > > consistency guarantees is something
> both
> > > Kafka
> > > > > > and
> > > > > > > > > > plugin
> > > > > > > > > > >> > > > designers
> > > > > > > > > > >> > > > > > > > > could help understand better.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > Many thanks,
> > > > > > > > > > >> > > > > > > > > Alexandre
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > > > > > > > > > >> > <jun@confluent.io.invalid
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > a
> > > > > > > > > > >> > > > > > > > écrit :
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Hi, Divij,
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Thanks for the reply.
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Point #1. Is the average remote
> > segment
> > > > > > metadata
> > > > > > > > > > really
> > > > > > > > > > >> > 1KB?
> > > > > > > > > > >> > > > > What's
> > > > > > > > > > >> > > > > > > > > listed
> > > > > > > > > > >> > > > > > > > > > in the public interface is probably
> > well
> > > > > > below 100
> > > > > > > > > > >> bytes.
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Point #2. I guess you are assuming
> > that
> > > each
> > > > > > > > broker
> > > > > > > > > > only
> > > > > > > > > > >> > > caches
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > > remote
> > > > > > > > > > >> > > > > > > > > > segment metadata in memory. An
> > > alternative
> > > > > > > > approach
> > > > > > > > > is
> > > > > > > > > > >> to
> > > > > > > > > > >> > > cache
> > > > > > > > > > >> > > > > > them
> > > > > > > > > > >> > > > > > > in
> > > > > > > > > > >> > > > > > > > > > both memory and local disk. That
> way,
> > on
> > > > > > broker
> > > > > > > > > > restart,
> > > > > > > > > > >> > you
> > > > > > > > > > >> > > > just
> > > > > > > > > > >> > > > > > > need
> > > > > > > > > > >> > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > fetch the new remote segments'
> > metadata
> > > > > using
> > > > > > the
> > > > > > > > > > >> > > > > > > > > >
> listRemoteLogSegments(TopicIdPartition
> > > > > > > > > > topicIdPartition,
> > > > > > > > > > >> > int
> > > > > > > > > > >> > > > > > > > leaderEpoch)
> > > > > > > > > > >> > > > > > > > > > api. Will that work?
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Point #3. Thanks for the explanation
> > > and it
> > > > > > sounds
> > > > > > > > > > good.
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Thanks,
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Jun
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM
> Divij
> > > > > Vaidya <
> > > > > > > > > > >> > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > > >> > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > Hi Jun
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > There are three points that I
> would
> > > like
> > > > > to
> > > > > > > > > present
> > > > > > > > > > >> here:
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > 1. We would require a large cache
> > > size to
> > > > > > > > > > efficiently
> > > > > > > > > > >> > cache
> > > > > > > > > > >> > > > all
> > > > > > > > > > >> > > > > > > > segment
> > > > > > > > > > >> > > > > > > > > > > metadata.
> > > > > > > > > > >> > > > > > > > > > > 2. Linear scan of all metadata at
> > > broker
> > > > > > startup
> > > > > > > > > to
> > > > > > > > > > >> > > populate
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > cache
> > > > > > > > > > >> > > > > > > > > will
> > > > > > > > > > >> > > > > > > > > > > be slow and will impact the
> archival
> > > > > > process.
> > > > > > > > > > >> > > > > > > > > > > 3. There is no other use case
> where
> > a
> > > full
> > > > > > scan
> > > > > > > > of
> > > > > > > > > > >> > segment
> > > > > > > > > > >> > > > > > metadata
> > > > > > > > > > >> > > > > > > > is
> > > > > > > > > > >> > > > > > > > > > > required.
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > Let's start by quantifying 1.
> Here's
> > > my
> > > > > > estimate
> > > > > > > > > for
> > > > > > > > > > >> the
> > > > > > > > > > >> > > size
> > > > > > > > > > >> > > > > of
> > > > > > > > > > >> > > > > > > the
> > > > > > > > > > >> > > > > > > > > cache.
> > > > > > > > > > >> > > > > > > > > > > Average size of segment metadata =
> > > 1KB.
> > > > > This
> > > > > > > > could
> > > > > > > > > > be
> > > > > > > > > > >> > more
> > > > > > > > > > >> > > if
> > > > > > > > > > >> > > > > we
> > > > > > > > > > >> > > > > > > have
> > > > > > > > > > >> > > > > > > > > > > frequent leader failover with a
> > large
> > > > > > number of
> > > > > > > > > > leader
> > > > > > > > > > >> > > epochs
> > > > > > > > > > >> > > > > > being
> > > > > > > > > > >> > > > > > > > > stored
> > > > > > > > > > >> > > > > > > > > > > per segment.
> > > > > > > > > > >> > > > > > > > > > > Segment size = 100MB. Users will
> > > prefer to
> > > > > > > > reduce
> > > > > > > > > > the
> > > > > > > > > > >> > > segment
> > > > > > > > > > >> > > > > > size
> > > > > > > > > > >> > > > > > > > > from the
> > > > > > > > > > >> > > > > > > > > > > default value of 1GB to ensure
> > timely
> > > > > > archival
> > > > > > > > of
> > > > > > > > > > data
> > > > > > > > > > >> > > since
> > > > > > > > > > >> > > > > data
> > > > > > > > > > >> > > > > > > > from
> > > > > > > > > > >> > > > > > > > > > > active segment is not archived.
> > > > > > > > > > >> > > > > > > > > > > Cache size = num segments * avg.
> > > segment
> > > > > > > > metadata
> > > > > > > > > > >> size =
> > > > > > > > > > >> > > > > > > > > (100TB/100MB)*1KB
> > > > > > > > > > >> > > > > > > > > > > = 1GB.
> > > > > > > > > > >> > > > > > > > > > > While 1GB for cache may not sound
> > > like a
> > > > > > large
> > > > > > > > > > number
> > > > > > > > > > >> for
> > > > > > > > > > >> > > > > larger
> > > > > > > > > > >> > > > > > > > > machines,
> > > > > > > > > > >> > > > > > > > > > > it does eat into the memory as an
> > > > > additional
> > > > > > > > cache
> > > > > > > > > > and
> > > > > > > > > > >> > > makes
> > > > > > > > > > >> > > > > use
> > > > > > > > > > >> > > > > > > > cases
> > > > > > > > > > >> > > > > > > > > with
> > > > > > > > > > >> > > > > > > > > > > large data retention with low
> > > throughout
> > > > > > > > expensive
> > > > > > > > > > >> (where
> > > > > > > > > > >> > > > such
> > > > > > > > > > >> > > > > > use
> > > > > > > > > > >> > > > > > > > case
> > > > > > > > > > >> > > > > > > > > > > would could use smaller machines).
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > About point#2:
> > > > > > > > > > >> > > > > > > > > > > Even if we say that all segment
> > > metadata
> > > > > > can fit
> > > > > > > > > > into
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > cache,
> > > > > > > > > > >> > > > > > we
> > > > > > > > > > >> > > > > > > > > will
> > > > > > > > > > >> > > > > > > > > > > need to populate the cache on
> broker
> > > > > > startup. It
> > > > > > > > > > would
> > > > > > > > > > >> > not
> > > > > > > > > > >> > > be
> > > > > > > > > > >> > > > > in
> > > > > > > > > > >> > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > critical patch of broker startup
> and
> > > hence
> > > > > > won't
> > > > > > > > > > >> impact
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > > startup
> > > > > > > > > > >> > > > > > > > > time.
> > > > > > > > > > >> > > > > > > > > > > But it will impact the time when
> we
> > > could
> > > > > > start
> > > > > > > > > the
> > > > > > > > > > >> > > archival
> > > > > > > > > > >> > > > > > > process
> > > > > > > > > > >> > > > > > > > > since
> > > > > > > > > > >> > > > > > > > > > > the RLM thread pool will be
> blocked
> > > on the
> > > > > > first
> > > > > > > > > > call
> > > > > > > > > > >> to
> > > > > > > > > > >> > > > > > > > > > > listRemoteLogSegments(). To scan
> > > metadata
> > > > > > for
> > > > > > > > 1MM
> > > > > > > > > > >> > segments
> > > > > > > > > > >> > > > > > > (computed
> > > > > > > > > > >> > > > > > > > > above)
> > > > > > > > > > >> > > > > > > > > > > and transfer 1GB data over the
> > network
> > > > > from
> > > > > > a
> > > > > > > > RLMM
> > > > > > > > > > >> such
> > > > > > > > > > >> > as
> > > > > > > > > > >> > > a
> > > > > > > > > > >> > > > > > remote
> > > > > > > > > > >> > > > > > > > > > > database would be in the order of
> > > minutes
> > > > > > > > > (depending
> > > > > > > > > > >> on
> > > > > > > > > > >> > how
> > > > > > > > > > >> > > > > > > efficient
> > > > > > > > > > >> > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > scan is with the RLMM
> > implementation).
> > > > > > > > Although, I
> > > > > > > > > > >> would
> > > > > > > > > > >> > > > > concede
> > > > > > > > > > >> > > > > > > that
> > > > > > > > > > >> > > > > > > > > > > having RLM threads blocked for a
> few
> > > > > > minutes is
> > > > > > > > > > >> perhaps
> > > > > > > > > > >> > OK
> > > > > > > > > > >> > > > but
> > > > > > > > > > >> > > > > if
> > > > > > > > > > >> > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > introduce the new API proposed in
> > the
> > > KIP,
> > > > > > we
> > > > > > > > > would
> > > > > > > > > > >> have
> > > > > > > > > > >> > a
> > > > > > > > > > >> > > > > > > > > > > deterministic startup time for
> RLM.
> > > Adding
> > > > > > the
> > > > > > > > API
> > > > > > > > > > >> comes
> > > > > > > > > > >> > > at a
> > > > > > > > > > >> > > > > low
> > > > > > > > > > >> > > > > > > > cost
> > > > > > > > > > >> > > > > > > > > and
> > > > > > > > > > >> > > > > > > > > > > I believe the trade off is worth
> it.
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > About point#3:
> > > > > > > > > > >> > > > > > > > > > > We can use
> > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > > >> > > > > > topicIdPartition,
> > > > > > > > > > >> > > > > > > > int
> > > > > > > > > > >> > > > > > > > > > > leaderEpoch) to calculate the
> > segments
> > > > > > eligible
> > > > > > > > > for
> > > > > > > > > > >> > > deletion
> > > > > > > > > > >> > > > > > (based
> > > > > > > > > > >> > > > > > > > on
> > > > > > > > > > >> > > > > > > > > size
> > > > > > > > > > >> > > > > > > > > > > retention) where leader epoch(s)
> > > belong to
> > > > > > the
> > > > > > > > > > current
> > > > > > > > > > >> > > leader
> > > > > > > > > > >> > > > > > epoch
> > > > > > > > > > >> > > > > > > > > chain.
> > > > > > > > > > >> > > > > > > > > > > I understand that it may lead to
> > > segments
> > > > > > > > > belonging
> > > > > > > > > > to
> > > > > > > > > > >> > > other
> > > > > > > > > > >> > > > > > epoch
> > > > > > > > > > >> > > > > > > > > lineage
> > > > > > > > > > >> > > > > > > > > > > not getting deleted and would
> > require
> > > a
> > > > > > separate
> > > > > > > > > > >> > mechanism
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > > > > delete
> > > > > > > > > > >> > > > > > > > > them.
> > > > > > > > > > >> > > > > > > > > > > The separate mechanism would
> anyways
> > > be
> > > > > > required
> > > > > > > > > to
> > > > > > > > > > >> > delete
> > > > > > > > > > >> > > > > these
> > > > > > > > > > >> > > > > > > > > "leaked"
> > > > > > > > > > >> > > > > > > > > > > segments as there are other cases
> > > which
> > > > > > could
> > > > > > > > lead
> > > > > > > > > > to
> > > > > > > > > > >> > leaks
> > > > > > > > > > >> > > > > such
> > > > > > > > > > >> > > > > > as
> > > > > > > > > > >> > > > > > > > > network
> > > > > > > > > > >> > > > > > > > > > > problems with RSM mid way writing
> > > through.
> > > > > > > > segment
> > > > > > > > > > >> etc.
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > Thank you for the replies so far.
> > They
> > > > > have
> > > > > > made
> > > > > > > > > me
> > > > > > > > > > >> > > re-think
> > > > > > > > > > >> > > > my
> > > > > > > > > > >> > > > > > > > > assumptions
> > > > > > > > > > >> > > > > > > > > > > and this dialogue has been very
> > > > > > constructive for
> > > > > > > > > me.
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > Regards,
> > > > > > > > > > >> > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM
> Jun
> > > Rao
> > > > > > > > > > >> > > > > > <jun@confluent.io.invalid
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > Hi, Divij,
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > Thanks for the reply.
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > It's true that the data in Kafka
> > > could
> > > > > be
> > > > > > kept
> > > > > > > > > > >> longer
> > > > > > > > > > >> > > with
> > > > > > > > > > >> > > > > > > KIP-405.
> > > > > > > > > > >> > > > > > > > > How
> > > > > > > > > > >> > > > > > > > > > > > much data do you envision to
> have
> > > per
> > > > > > broker?
> > > > > > > > > For
> > > > > > > > > > >> 100TB
> > > > > > > > > > >> > > > data
> > > > > > > > > > >> > > > > > per
> > > > > > > > > > >> > > > > > > > > broker,
> > > > > > > > > > >> > > > > > > > > > > > with 1GB segment and segment
> > > metadata of
> > > > > > 100
> > > > > > > > > > bytes,
> > > > > > > > > > >> it
> > > > > > > > > > >> > > > > requires
> > > > > > > > > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which
> should
> > > fit
> > > > > in
> > > > > > > > > memory.
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > > > > > > > > > >> > listRemoteLogSegments()
> > > > > > > > > > >> > > > > > methods.
> > > > > > > > > > >> > > > > > > > > The one
> > > > > > > > > > >> > > > > > > > > > > > you listed
> > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > > >> > > > > > > topicIdPartition,
> > > > > > > > > > >> > > > > > > > > int
> > > > > > > > > > >> > > > > > > > > > > > leaderEpoch) does return data in
> > > offset
> > > > > > order.
> > > > > > > > > > >> However,
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > > other
> > > > > > > > > > >> > > > > > > > > > > > one
> > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > > >> > > > topicIdPartition)
> > > > > > > > > > >> > > > > > > > doesn't
> > > > > > > > > > >> > > > > > > > > > > > specify the return order. I
> assume
> > > that
> > > > > > you
> > > > > > > > need
> > > > > > > > > > the
> > > > > > > > > > >> > > latter
> > > > > > > > > > >> > > > > to
> > > > > > > > > > >> > > > > > > > > calculate
> > > > > > > > > > >> > > > > > > > > > > > the segment size?
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > Thanks,
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > Jun
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM
> > > Divij
> > > > > > Vaidya
> > > > > > > > <
> > > > > > > > > > >> > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > > >> > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > *Jun,*
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > *"the default implementation
> of
> > > RLMM
> > > > > > does
> > > > > > > > > local
> > > > > > > > > > >> > > caching,
> > > > > > > > > > >> > > > > > > right?"*
> > > > > > > > > > >> > > > > > > > > > > > > Yes, Jun. The default
> > > implementation
> > > > > of
> > > > > > RLMM
> > > > > > > > > > does
> > > > > > > > > > >> > > indeed
> > > > > > > > > > >> > > > > > cache
> > > > > > > > > > >> > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > segment
> > > > > > > > > > >> > > > > > > > > > > > > metadata today, hence, it
> won't
> > > work
> > > > > > for use
> > > > > > > > > > cases
> > > > > > > > > > >> > when
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > number
> > > > > > > > > > >> > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > segments in remote storage is
> > > large
> > > > > > enough
> > > > > > > > to
> > > > > > > > > > >> exceed
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > size
> > > > > > > > > > >> > > > > > > of
> > > > > > > > > > >> > > > > > > > > cache.
> > > > > > > > > > >> > > > > > > > > > > > As
> > > > > > > > > > >> > > > > > > > > > > > > part of this KIP, I will
> > > implement the
> > > > > > new
> > > > > > > > > > >> proposed
> > > > > > > > > > >> > API
> > > > > > > > > > >> > > > in
> > > > > > > > > > >> > > > > > the
> > > > > > > > > > >> > > > > > > > > default
> > > > > > > > > > >> > > > > > > > > > > > > implementation of RLMM but the
> > > > > > underlying
> > > > > > > > > > >> > > implementation
> > > > > > > > > > >> > > > > will
> > > > > > > > > > >> > > > > > > > > still be
> > > > > > > > > > >> > > > > > > > > > > a
> > > > > > > > > > >> > > > > > > > > > > > > scan. I will pick up
> optimizing
> > > that
> > > > > in
> > > > > > a
> > > > > > > > > > separate
> > > > > > > > > > >> > PR.
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > *"we also cache all segment
> > > metadata
> > > > > in
> > > > > > the
> > > > > > > > > > >> brokers
> > > > > > > > > > >> > > > without
> > > > > > > > > > >> > > > > > > > > KIP-405. Do
> > > > > > > > > > >> > > > > > > > > > > > you
> > > > > > > > > > >> > > > > > > > > > > > > see a need to change that?"*
> > > > > > > > > > >> > > > > > > > > > > > > Please correct me if I am
> wrong
> > > here
> > > > > > but we
> > > > > > > > > > cache
> > > > > > > > > > >> > > > metadata
> > > > > > > > > > >> > > > > > for
> > > > > > > > > > >> > > > > > > > > segments
> > > > > > > > > > >> > > > > > > > > > > > > "residing in local storage".
> The
> > > size
> > > > > > of the
> > > > > > > > > > >> current
> > > > > > > > > > >> > > > cache
> > > > > > > > > > >> > > > > > > works
> > > > > > > > > > >> > > > > > > > > fine
> > > > > > > > > > >> > > > > > > > > > > for
> > > > > > > > > > >> > > > > > > > > > > > > the scale of the number of
> > > segments
> > > > > > that we
> > > > > > > > > > >> expect to
> > > > > > > > > > >> > > > store
> > > > > > > > > > >> > > > > > in
> > > > > > > > > > >> > > > > > > > > local
> > > > > > > > > > >> > > > > > > > > > > > > storage. After KIP-405, that
> > cache
> > > > > will
> > > > > > > > > continue
> > > > > > > > > > >> to
> > > > > > > > > > >> > > store
> > > > > > > > > > >> > > > > > > > metadata
> > > > > > > > > > >> > > > > > > > > for
> > > > > > > > > > >> > > > > > > > > > > > > segments which are residing in
> > > local
> > > > > > storage
> > > > > > > > > and
> > > > > > > > > > >> > hence,
> > > > > > > > > > >> > > > we
> > > > > > > > > > >> > > > > > > don't
> > > > > > > > > > >> > > > > > > > > need
> > > > > > > > > > >> > > > > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > change that. For segments
> which
> > > have
> > > > > > been
> > > > > > > > > > >> offloaded
> > > > > > > > > > >> > to
> > > > > > > > > > >> > > > > remote
> > > > > > > > > > >> > > > > > > > > storage,
> > > > > > > > > > >> > > > > > > > > > > it
> > > > > > > > > > >> > > > > > > > > > > > > would rely on RLMM. Note that
> > the
> > > > > scale
> > > > > > of
> > > > > > > > > data
> > > > > > > > > > >> > stored
> > > > > > > > > > >> > > in
> > > > > > > > > > >> > > > > > RLMM
> > > > > > > > > > >> > > > > > > is
> > > > > > > > > > >> > > > > > > > > > > > different
> > > > > > > > > > >> > > > > > > > > > > > > from local cache because the
> > > number of
> > > > > > > > > segments
> > > > > > > > > > is
> > > > > > > > > > >> > > > expected
> > > > > > > > > > >> > > > > > to
> > > > > > > > > > >> > > > > > > be
> > > > > > > > > > >> > > > > > > > > much
> > > > > > > > > > >> > > > > > > > > > > > > larger than what current
> > > > > implementation
> > > > > > > > stores
> > > > > > > > > > in
> > > > > > > > > > >> > local
> > > > > > > > > > >> > > > > > > storage.
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > 2,3,4:
> > > > > > > > > > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > does
> > > > > > > > > > >> > > > > > > > > specify
> > > > > > > > > > >> > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > order i.e. it returns the
> > segments
> > > > > > sorted by
> > > > > > > > > > first
> > > > > > > > > > >> > > offset
> > > > > > > > > > >> > > > > in
> > > > > > > > > > >> > > > > > > > > ascending
> > > > > > > > > > >> > > > > > > > > > > > > order. I am copying the API
> docs
> > > for
> > > > > > KIP-405
> > > > > > > > > > here
> > > > > > > > > > >> for
> > > > > > > > > > >> > > > your
> > > > > > > > > > >> > > > > > > > > reference
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > *Returns iterator of remote
> log
> > > > > segment
> > > > > > > > > > metadata,
> > > > > > > > > > >> > > sorted
> > > > > > > > > > >> > > > by
> > > > > > > > > > >> > > > > > > > {@link
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > RemoteLogSegmentMetadata#startOffset()}
> > > > > > > > > > >> inascending
> > > > > > > > > > >> > > order
> > > > > > > > > > >> > > > > > which
> > > > > > > > > > >> > > > > > > > > > > contains
> > > > > > > > > > >> > > > > > > > > > > > > the given leader epoch. This
> is
> > > used
> > > > > by
> > > > > > > > remote
> > > > > > > > > > log
> > > > > > > > > > >> > > > > retention
> > > > > > > > > > >> > > > > > > > > management
> > > > > > > > > > >> > > > > > > > > > > > > subsystemto fetch the segment
> > > metadata
> > > > > > for a
> > > > > > > > > > given
> > > > > > > > > > >> > > leader
> > > > > > > > > > >> > > > > > > > > epoch.@param
> > > > > > > > > > >> > > > > > > > > > > > > topicIdPartition topic
> > > partition@param
> > > > > > > > > > >> leaderEpoch
> > > > > > > > > > >> > > > > > leader
> > > > > > > > > > >> > > > > > > > > > > > > epoch@return
> > > > > > > > > > >> > > > > > > > > > > > > Iterator of remote segments,
> > > sorted by
> > > > > > start
> > > > > > > > > > >> offset
> > > > > > > > > > >> > in
> > > > > > > > > > >> > > > > > > ascending
> > > > > > > > > > >> > > > > > > > > > > order. *
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > *Luke,*
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > 5. Note that we are trying to
> > > optimize
> > > > > > the
> > > > > > > > > > >> efficiency
> > > > > > > > > > >> > > of
> > > > > > > > > > >> > > > > size
> > > > > > > > > > >> > > > > > > > based
> > > > > > > > > > >> > > > > > > > > > > > > retention for remote storage.
> > > KIP-405
> > > > > > does
> > > > > > > > not
> > > > > > > > > > >> > > introduce
> > > > > > > > > > >> > > > a
> > > > > > > > > > >> > > > > > new
> > > > > > > > > > >> > > > > > > > > config
> > > > > > > > > > >> > > > > > > > > > > for
> > > > > > > > > > >> > > > > > > > > > > > > periodically checking remote
> > > similar
> > > > > to
> > > > > > > > > > >> > > > > > > > > > > log.retention.check.interval.ms
> > > > > > > > > > >> > > > > > > > > > > > > which is applicable for remote
> > > > > storage.
> > > > > > > > Hence,
> > > > > > > > > > the
> > > > > > > > > > >> > > metric
> > > > > > > > > > >> > > > > > will
> > > > > > > > > > >> > > > > > > be
> > > > > > > > > > >> > > > > > > > > > > updated
> > > > > > > > > > >> > > > > > > > > > > > > at the time of invoking log
> > > retention
> > > > > > check
> > > > > > > > > for
> > > > > > > > > > >> > remote
> > > > > > > > > > >> > > > tier
> > > > > > > > > > >> > > > > > > which
> > > > > > > > > > >> > > > > > > > > is
> > > > > > > > > > >> > > > > > > > > > > > > pending implementation today.
> We
> > > can
> > > > > > perhaps
> > > > > > > > > > come
> > > > > > > > > > >> > back
> > > > > > > > > > >> > > > and
> > > > > > > > > > >> > > > > > > update
> > > > > > > > > > >> > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > metric description after the
> > > > > > implementation
> > > > > > > > of
> > > > > > > > > > log
> > > > > > > > > > >> > > > > retention
> > > > > > > > > > >> > > > > > > > check
> > > > > > > > > > >> > > > > > > > > in
> > > > > > > > > > >> > > > > > > > > > > > > RemoteLogManager.
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > --
> > > > > > > > > > >> > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16
> AM
> > > Luke
> > > > > > Chen <
> > > > > > > > > > >> > > > > showuon@gmail.com
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > Hi Divij,
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > One more question about the
> > > metric:
> > > > > > > > > > >> > > > > > > > > > > > > > I think the metric will be
> > > updated
> > > > > > when
> > > > > > > > > > >> > > > > > > > > > > > > > (1) each time we run the log
> > > > > retention
> > > > > > > > check
> > > > > > > > > > >> (that
> > > > > > > > > > >> > > is,
> > > > > > > > > > >> > > > > > > > > > > > > >
> > log.retention.check.interval.ms
> > > )
> > > > > > > > > > >> > > > > > > > > > > > > > (2) When user explicitly
> call
> > > > > > > > > getRemoteLogSize
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > Is that correct?
> > > > > > > > > > >> > > > > > > > > > > > > > Maybe we should add a note
> in
> > > metric
> > > > > > > > > > >> description,
> > > > > > > > > > >> > > > > > otherwise,
> > > > > > > > > > >> > > > > > > > when
> > > > > > > > > > >> > > > > > > > > > > user
> > > > > > > > > > >> > > > > > > > > > > > > got,
> > > > > > > > > > >> > > > > > > > > > > > > > let's say 0 of
> > > RemoteLogSizeBytes,
> > > > > > will be
> > > > > > > > > > >> > surprised.
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > Otherwise, LGTM
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > Thank you for the KIP
> > > > > > > > > > >> > > > > > > > > > > > > > Luke
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55
> > AM
> > > Jun
> > > > > > Rao
> > > > > > > > > > >> > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > Thanks for the
> explanation.
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > 1. Hmm, the default
> > > implementation
> > > > > > of
> > > > > > > > RLMM
> > > > > > > > > > >> does
> > > > > > > > > > >> > > local
> > > > > > > > > > >> > > > > > > > caching,
> > > > > > > > > > >> > > > > > > > > > > right?
> > > > > > > > > > >> > > > > > > > > > > > > > > Currently, we also cache
> all
> > > > > segment
> > > > > > > > > > metadata
> > > > > > > > > > >> in
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > > > brokers
> > > > > > > > > > >> > > > > > > > > > > without
> > > > > > > > > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need
> > to
> > > > > change
> > > > > > > > that?
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your
> explanation
> > > makes
> > > > > > > > sense.
> > > > > > > > > > >> > However,
> > > > > > > > > > >> > > > > > > > > > > > > > > currently,
> > > > > > > > > > >> > > > > >
> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > > > doesn't
> > > > > > > > > > >> > > > > > > > > > > > > > specify
> > > > > > > > > > >> > > > > > > > > > > > > > > a particular order of the
> > > > > iterator.
> > > > > > Do
> > > > > > > > you
> > > > > > > > > > >> intend
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > > > change
> > > > > > > > > > >> > > > > > > > > that?
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > Jun
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at
> 3:31
> > AM
> > > > > Divij
> > > > > > > > > Vaidya
> > > > > > > > > > <
> > > > > > > > > > >> > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > Hey Jun
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > Thank you for your
> > comments.
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor
> > could
> > > > > ensure
> > > > > > > > that
> > > > > > > > > > >> > > > > > > > > listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > > > > > is
> > > > > > > > > > >> > > > > > > > > > > > > > fast"*
> > > > > > > > > > >> > > > > > > > > > > > > > > > This would be ideal but
> > > > > > pragmatically,
> > > > > > > > > it
> > > > > > > > > > is
> > > > > > > > > > >> > > > > difficult
> > > > > > > > > > >> > > > > > to
> > > > > > > > > > >> > > > > > > > > ensure
> > > > > > > > > > >> > > > > > > > > > > > that
> > > > > > > > > > >> > > > > > > > > > > > > > > > listRemoteLogSegments()
> is
> > > fast.
> > > > > > This
> > > > > > > > is
> > > > > > > > > > >> > because
> > > > > > > > > > >> > > of
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > > > > > > possibility
> > > > > > > > > > >> > > > > > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > > a
> > > > > > > > > > >> > > > > > > > > > > > > > > > large number of segments
> > > (much
> > > > > > larger
> > > > > > > > > than
> > > > > > > > > > >> what
> > > > > > > > > > >> > > > Kafka
> > > > > > > > > > >> > > > > > > > > currently
> > > > > > > > > > >> > > > > > > > > > > > > handles
> > > > > > > > > > >> > > > > > > > > > > > > > > > with local storage
> today)
> > > would
> > > > > > make
> > > > > > > > it
> > > > > > > > > > >> > > infeasible
> > > > > > > > > > >> > > > to
> > > > > > > > > > >> > > > > > > adopt
> > > > > > > > > > >> > > > > > > > > > > > > strategies
> > > > > > > > > > >> > > > > > > > > > > > > > > such
> > > > > > > > > > >> > > > > > > > > > > > > > > > as local caching to
> > improve
> > > the
> > > > > > > > > > performance
> > > > > > > > > > >> of
> > > > > > > > > > >> > > > > > > > > > > > listRemoteLogSegments.
> > > > > > > > > > >> > > > > > > > > > > > > > > Apart
> > > > > > > > > > >> > > > > > > > > > > > > > > > from caching (which
> won't
> > > work
> > > > > > due to
> > > > > > > > > size
> > > > > > > > > > >> > > > > > limitations) I
> > > > > > > > > > >> > > > > > > > > can't
> > > > > > > > > > >> > > > > > > > > > > > think
> > > > > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > > > > other strategies which
> may
> > > > > > eliminate
> > > > > > > > the
> > > > > > > > > > >> need
> > > > > > > > > > >> > for
> > > > > > > > > > >> > > > IO
> > > > > > > > > > >> > > > > > > > > > > > > > > > operations proportional
> to
> > > the
> > > > > > number
> > > > > > > > of
> > > > > > > > > > >> total
> > > > > > > > > > >> > > > > > segments.
> > > > > > > > > > >> > > > > > > > > Please
> > > > > > > > > > >> > > > > > > > > > > > > advise
> > > > > > > > > > >> > > > > > > > > > > > > > if
> > > > > > > > > > >> > > > > > > > > > > > > > > > you have something in
> > mind.
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > 2.  "*If the size
> exceeds
> > > the
> > > > > > > > retention
> > > > > > > > > > >> size,
> > > > > > > > > > >> > we
> > > > > > > > > > >> > > > need
> > > > > > > > > > >> > > > > > to
> > > > > > > > > > >> > > > > > > > > > > determine
> > > > > > > > > > >> > > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > > subset of segments to
> > > delete to
> > > > > > bring
> > > > > > > > > the
> > > > > > > > > > >> size
> > > > > > > > > > >> > > > within
> > > > > > > > > > >> > > > > > the
> > > > > > > > > > >> > > > > > > > > > > retention
> > > > > > > > > > >> > > > > > > > > > > > > > > limit.
> > > > > > > > > > >> > > > > > > > > > > > > > > > Do we need to call
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > > > > determine that?"*
> > > > > > > > > > >> > > > > > > > > > > > > > > > Yes, we need to call
> > > > > > > > > > >> listRemoteLogSegments() to
> > > > > > > > > > >> > > > > > determine
> > > > > > > > > > >> > > > > > > > > which
> > > > > > > > > > >> > > > > > > > > > > > > > segments
> > > > > > > > > > >> > > > > > > > > > > > > > > > should be deleted. But
> > > there is
> > > > > a
> > > > > > > > > > difference
> > > > > > > > > > >> > with
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > use
> > > > > > > > > > >> > > > > > > > > case we
> > > > > > > > > > >> > > > > > > > > > > > are
> > > > > > > > > > >> > > > > > > > > > > > > > > > trying to optimize with
> > this
> > > > > KIP.
> > > > > > To
> > > > > > > > > > >> determine
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > > subset
> > > > > > > > > > >> > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > segments
> > > > > > > > > > >> > > > > > > > > > > > > > > which
> > > > > > > > > > >> > > > > > > > > > > > > > > > would be deleted, we
> only
> > > read
> > > > > > > > metadata
> > > > > > > > > > for
> > > > > > > > > > >> > > > segments
> > > > > > > > > > >> > > > > > > which
> > > > > > > > > > >> > > > > > > > > would
> > > > > > > > > > >> > > > > > > > > > > be
> > > > > > > > > > >> > > > > > > > > > > > > > > deleted
> > > > > > > > > > >> > > > > > > > > > > > > > > > via the
> > > listRemoteLogSegments().
> > > > > > But
> > > > > > > > to
> > > > > > > > > > >> > determine
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > > > > totalLogSize,
> > > > > > > > > > >> > > > > > > > > > > > > > which
> > > > > > > > > > >> > > > > > > > > > > > > > > > is required every time
> > > retention
> > > > > > logic
> > > > > > > > > > >> based on
> > > > > > > > > > >> > > > size
> > > > > > > > > > >> > > > > > > > > executes, we
> > > > > > > > > > >> > > > > > > > > > > > > read
> > > > > > > > > > >> > > > > > > > > > > > > > > > metadata of *all* the
> > > segments
> > > > > in
> > > > > > > > remote
> > > > > > > > > > >> > storage.
> > > > > > > > > > >> > > > > > Hence,
> > > > > > > > > > >> > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > number
> > > > > > > > > > >> > > > > > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > > > > results returned by
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > > > > > > > > *is
> > > > > > > > > > >> > > > > > > > > > > > > > > > different when we are
> > > > > calculating
> > > > > > > > > > >> totalLogSize
> > > > > > > > > > >> > > vs.
> > > > > > > > > > >> > > > > when
> > > > > > > > > > >> > > > > > > we
> > > > > > > > > > >> > > > > > > > > are
> > > > > > > > > > >> > > > > > > > > > > > > > > determining
> > > > > > > > > > >> > > > > > > > > > > > > > > > the subset of segments
> to
> > > > > delete.
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > 3.
> > > > > > > > > > >> > > > > > > > > > > > > > > > *"Also, what about
> > > time-based
> > > > > > > > retention?
> > > > > > > > > > To
> > > > > > > > > > >> > make
> > > > > > > > > > >> > > > that
> > > > > > > > > > >> > > > > > > > > efficient,
> > > > > > > > > > >> > > > > > > > > > > do
> > > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > > > > >> > > > > > > > > > > > > > > > to make some additional
> > > > > interface
> > > > > > > > > > >> changes?"*No.
> > > > > > > > > > >> > > > Note
> > > > > > > > > > >> > > > > > that
> > > > > > > > > > >> > > > > > > > > time
> > > > > > > > > > >> > > > > > > > > > > > > > complexity
> > > > > > > > > > >> > > > > > > > > > > > > > > > to determine the
> segments
> > > for
> > > > > > > > retention
> > > > > > > > > is
> > > > > > > > > > >> > > > different
> > > > > > > > > > >> > > > > > for
> > > > > > > > > > >> > > > > > > > time
> > > > > > > > > > >> > > > > > > > > > > based
> > > > > > > > > > >> > > > > > > > > > > > > vs.
> > > > > > > > > > >> > > > > > > > > > > > > > > > size based. For time
> > based,
> > > the
> > > > > > time
> > > > > > > > > > >> complexity
> > > > > > > > > > >> > > is
> > > > > > > > > > >> > > > a
> > > > > > > > > > >> > > > > > > > > function of
> > > > > > > > > > >> > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > number
> > > > > > > > > > >> > > > > > > > > > > > > > > > of segments which are
> > > "eligible
> > > > > > for
> > > > > > > > > > >> deletion"
> > > > > > > > > > >> > > > (since
> > > > > > > > > > >> > > > > we
> > > > > > > > > > >> > > > > > > > only
> > > > > > > > > > >> > > > > > > > > read
> > > > > > > > > > >> > > > > > > > > > > > > > > metadata
> > > > > > > > > > >> > > > > > > > > > > > > > > > for segments which would
> > be
> > > > > > deleted)
> > > > > > > > > > >> whereas in
> > > > > > > > > > >> > > > size
> > > > > > > > > > >> > > > > > > based
> > > > > > > > > > >> > > > > > > > > > > > retention,
> > > > > > > > > > >> > > > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > > time complexity is a
> > > function of
> > > > > > "all
> > > > > > > > > > >> segments"
> > > > > > > > > > >> > > > > > available
> > > > > > > > > > >> > > > > > > > in
> > > > > > > > > > >> > > > > > > > > > > remote
> > > > > > > > > > >> > > > > > > > > > > > > > > storage
> > > > > > > > > > >> > > > > > > > > > > > > > > > (metadata of all
> segments
> > > needs
> > > > > > to be
> > > > > > > > > read
> > > > > > > > > > >> to
> > > > > > > > > > >> > > > > calculate
> > > > > > > > > > >> > > > > > > the
> > > > > > > > > > >> > > > > > > > > total
> > > > > > > > > > >> > > > > > > > > > > > > > size).
> > > > > > > > > > >> > > > > > > > > > > > > > > As
> > > > > > > > > > >> > > > > > > > > > > > > > > > you may observe, this
> KIP
> > > will
> > > > > > bring
> > > > > > > > the
> > > > > > > > > > >> time
> > > > > > > > > > >> > > > > > complexity
> > > > > > > > > > >> > > > > > > > for
> > > > > > > > > > >> > > > > > > > > both
> > > > > > > > > > >> > > > > > > > > > > > > time
> > > > > > > > > > >> > > > > > > > > > > > > > > > based retention & size
> > based
> > > > > > retention
> > > > > > > > > to
> > > > > > > > > > >> the
> > > > > > > > > > >> > > same
> > > > > > > > > > >> > > > > > > > function.
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > 4. Also, please note
> that
> > > this
> > > > > > new API
> > > > > > > > > > >> > introduced
> > > > > > > > > > >> > > > in
> > > > > > > > > > >> > > > > > this
> > > > > > > > > > >> > > > > > > > KIP
> > > > > > > > > > >> > > > > > > > > > > also
> > > > > > > > > > >> > > > > > > > > > > > > > > enables
> > > > > > > > > > >> > > > > > > > > > > > > > > > us to provide a metric
> for
> > > total
> > > > > > size
> > > > > > > > of
> > > > > > > > > > >> data
> > > > > > > > > > >> > > > stored
> > > > > > > > > > >> > > > > in
> > > > > > > > > > >> > > > > > > > > remote
> > > > > > > > > > >> > > > > > > > > > > > > storage.
> > > > > > > > > > >> > > > > > > > > > > > > > > > Without the API,
> > > calculation of
> > > > > > this
> > > > > > > > > > metric
> > > > > > > > > > >> > will
> > > > > > > > > > >> > > > > become
> > > > > > > > > > >> > > > > > > > very
> > > > > > > > > > >> > > > > > > > > > > > > expensive
> > > > > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> *listRemoteLogSegments().*
> > > > > > > > > > >> > > > > > > > > > > > > > > > I understand that your
> > > > > motivation
> > > > > > here
> > > > > > > > > is
> > > > > > > > > > to
> > > > > > > > > > >> > > avoid
> > > > > > > > > > >> > > > > > > > polluting
> > > > > > > > > > >> > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > interface
> > > > > > > > > > >> > > > > > > > > > > > > > > > with optimization
> specific
> > > APIs
> > > > > > and I
> > > > > > > > > will
> > > > > > > > > > >> > agree
> > > > > > > > > > >> > > > with
> > > > > > > > > > >> > > > > > > that
> > > > > > > > > > >> > > > > > > > > goal.
> > > > > > > > > > >> > > > > > > > > > > > But
> > > > > > > > > > >> > > > > > > > > > > > > I
> > > > > > > > > > >> > > > > > > > > > > > > > > > believe that this new
> API
> > > > > > proposed in
> > > > > > > > > the
> > > > > > > > > > >> KIP
> > > > > > > > > > >> > > > brings
> > > > > > > > > > >> > > > > in
> > > > > > > > > > >> > > > > > > > > > > significant
> > > > > > > > > > >> > > > > > > > > > > > > > > > improvement and there is
> > no
> > > > > other
> > > > > > work
> > > > > > > > > > >> around
> > > > > > > > > > >> > > > > available
> > > > > > > > > > >> > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > achieve
> > > > > > > > > > >> > > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > same
> > > > > > > > > > >> > > > > > > > > > > > > > > > performance.
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > >> > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at
> > > 12:12 AM
> > > > > > Jun
> > > > > > > > Rao
> > > > > > > > > > >> > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks for the KIP.
> > Sorry
> > > for
> > > > > > the
> > > > > > > > late
> > > > > > > > > > >> reply.
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > The motivation of the
> > KIP
> > > is
> > > > > to
> > > > > > > > > improve
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > > efficiency
> > > > > > > > > > >> > > > > > > of
> > > > > > > > > > >> > > > > > > > > size
> > > > > > > > > > >> > > > > > > > > > > > > based
> > > > > > > > > > >> > > > > > > > > > > > > > > > > retention. I am not
> sure
> > > the
> > > > > > > > proposed
> > > > > > > > > > >> changes
> > > > > > > > > > >> > > are
> > > > > > > > > > >> > > > > > > enough.
> > > > > > > > > > >> > > > > > > > > For
> > > > > > > > > > >> > > > > > > > > > > > > > example,
> > > > > > > > > > >> > > > > > > > > > > > > > > if
> > > > > > > > > > >> > > > > > > > > > > > > > > > > the size exceeds the
> > > retention
> > > > > > size,
> > > > > > > > > we
> > > > > > > > > > >> need
> > > > > > > > > > >> > to
> > > > > > > > > > >> > > > > > > determine
> > > > > > > > > > >> > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > subset
> > > > > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > > > > > segments to delete to
> > > bring
> > > > > the
> > > > > > size
> > > > > > > > > > >> within
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > > > retention
> > > > > > > > > > >> > > > > > > > > > > limit.
> > > > > > > > > > >> > > > > > > > > > > > Do
> > > > > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > > > > > > need
> > > > > > > > > > >> > > > > > > > > > > > > > > > > to call
> > > > > > > > > > >> > > > > > >
> > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > determine
> > > > > > > > > > >> > > > > > > > > > > > > > > > that?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > Also, what about
> > > time-based
> > > > > > > > retention?
> > > > > > > > > > To
> > > > > > > > > > >> > make
> > > > > > > > > > >> > > > that
> > > > > > > > > > >> > > > > > > > > efficient,
> > > > > > > > > > >> > > > > > > > > > > do
> > > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > > > > >> > > > > > > > > > > > > > > > > to make some
> additional
> > > > > > interface
> > > > > > > > > > changes?
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > An alternative
> approach
> > > is for
> > > > > > the
> > > > > > > > > RLMM
> > > > > > > > > > >> > > > implementor
> > > > > > > > > > >> > > > > > to
> > > > > > > > > > >> > > > > > > > make
> > > > > > > > > > >> > > > > > > > > > > sure
> > > > > > > > > > >> > > > > > > > > > > > > > > > > that
> > > > > > > > > > >> > > > >
> RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > > >> > > > > > > is
> > > > > > > > > > >> > > > > > > > > fast
> > > > > > > > > > >> > > > > > > > > > > > > (e.g.,
> > > > > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > > > > >> > > > > > > > > > > > > > > > > local caching). This
> > way,
> > > we
> > > > > > could
> > > > > > > > > keep
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > interface
> > > > > > > > > > >> > > > > > > > > simple.
> > > > > > > > > > >> > > > > > > > > > > > Have
> > > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > > > > > > > considered that?
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > Jun
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022
> at
> > > 6:28
> > > > > AM
> > > > > > > > Divij
> > > > > > > > > > >> Vaidya
> > > > > > > > > > >> > <
> > > > > > > > > > >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > Does anyone else
> have
> > > any
> > > > > > thoughts
> > > > > > > > > on
> > > > > > > > > > >> this
> > > > > > > > > > >> > > > > before I
> > > > > > > > > > >> > > > > > > > > propose
> > > > > > > > > > >> > > > > > > > > > > > this
> > > > > > > > > > >> > > > > > > > > > > > > > for
> > > > > > > > > > >> > > > > > > > > > > > > > > a
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > vote?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022
> at
> > > 12:57
> > > > > > PM
> > > > > > > > > Satish
> > > > > > > > > > >> > > Duggana
> > > > > > > > > > >> > > > <
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> satish.duggana@gmail.com
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP
> > > Divij!
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > This is a nice
> > > improvement
> > > > > > to
> > > > > > > > > avoid
> > > > > > > > > > >> > > > > recalculation
> > > > > > > > > > >> > > > > > > of
> > > > > > > > > > >> > > > > > > > > size.
> > > > > > > > > > >> > > > > > > > > > > > > > > Customized
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > RLMMs
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > can implement the
> > best
> > > > > > possible
> > > > > > > > > > >> approach
> > > > > > > > > > >> > by
> > > > > > > > > > >> > > > > > caching
> > > > > > > > > > >> > > > > > > > or
> > > > > > > > > > >> > > > > > > > > > > > > > maintaining
> > > > > > > > > > >> > > > > > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > size
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > in an efficient
> way.
> > > But
> > > > > > this is
> > > > > > > > > > not a
> > > > > > > > > > >> > big
> > > > > > > > > > >> > > > > > concern
> > > > > > > > > > >> > > > > > > > for
> > > > > > > > > > >> > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > default
> > > > > > > > > > >> > > > > > > > > > > > > > > > > topic
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > based RLMM as
> > > mentioned in
> > > > > > the
> > > > > > > > > KIP.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul
> 2022
> > at
> > > > > > 18:48,
> > > > > > > > > Divij
> > > > > > > > > > >> > Vaidya
> > > > > > > > > > >> > > <
> > > > > > > > > > >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Thank you for
> your
> > > > > review
> > > > > > > > Luke.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that
> > > would the
> > > > > > new
> > > > > > > > > > >> > > > > > `RemoteLogSizeBytes`
> > > > > > > > > > >> > > > > > > > > metric
> > > > > > > > > > >> > > > > > > > > > > > be a
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > performance
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > overhead?
> Although
> > > we
> > > > > > move the
> > > > > > > > > > >> > > calculation
> > > > > > > > > > >> > > > > to a
> > > > > > > > > > >> > > > > > > > > seperate
> > > > > > > > > > >> > > > > > > > > > > > API,
> > > > > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > > > > >> > > > > > > > > > > > > > > > > still
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > can't assume
> users
> > > will
> > > > > > > > > implement
> > > > > > > > > > a
> > > > > > > > > > >> > > > > > light-weight
> > > > > > > > > > >> > > > > > > > > method,
> > > > > > > > > > >> > > > > > > > > > > > > right?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > This metric
> would
> > be
> > > > > > logged
> > > > > > > > > using
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > > information
> > > > > > > > > > >> > > > > > > > > that is
> > > > > > > > > > >> > > > > > > > > > > > > > already
> > > > > > > > > > >> > > > > > > > > > > > > > > > > being
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > calculated for
> > > handling
> > > > > > remote
> > > > > > > > > > >> > retention
> > > > > > > > > > >> > > > > logic,
> > > > > > > > > > >> > > > > > > > > hence, no
> > > > > > > > > > >> > > > > > > > > > > > > > > > additional
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > work
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > is required to
> > > calculate
> > > > > > this
> > > > > > > > > > >> metric.
> > > > > > > > > > >> > > More
> > > > > > > > > > >> > > > > > > > > specifically,
> > > > > > > > > > >> > > > > > > > > > > > > > whenever
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager
> > > calls
> > > > > > > > > > >> getRemoteLogSize
> > > > > > > > > > >> > > > API,
> > > > > > > > > > >> > > > > > this
> > > > > > > > > > >> > > > > > > > > metric
> > > > > > > > > > >> > > > > > > > > > > > > would
> > > > > > > > > > >> > > > > > > > > > > > > > be
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > captured.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > This API call is
> > > made
> > > > > > every
> > > > > > > > time
> > > > > > > > > > >> > > > > > RemoteLogManager
> > > > > > > > > > >> > > > > > > > > wants
> > > > > > > > > > >> > > > > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > > handle
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > expired
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > remote log
> > segments
> > > > > (which
> > > > > > > > > should
> > > > > > > > > > be
> > > > > > > > > > >> > > > > periodic).
> > > > > > > > > > >> > > > > > > > Does
> > > > > > > > > > >> > > > > > > > > that
> > > > > > > > > > >> > > > > > > > > > > > > > address
> > > > > > > > > > >> > > > > > > > > > > > > > > > > your
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > concern?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12,
> > > 2022 at
> > > > > > 11:01
> > > > > > > > AM
> > > > > > > > > > >> Luke
> > > > > > > > > > >> > > Chen
> > > > > > > > > > >> > > > <
> > > > > > > > > > >> > > > > > > > > > > > > showuon@gmail.com>
> > > > > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the
> > > KIP!
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > I think it
> makes
> > > sense
> > > > > > to
> > > > > > > > > > delegate
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > > > > > responsibility
> > > > > > > > > > >> > > > > > > > > > > of
> > > > > > > > > > >> > > > > > > > > > > > > > > > > calculation
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > specific
> > > > > > > > > > RemoteLogMetadataManager
> > > > > > > > > > >> > > > > > > implementation.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > But one thing
> > I'm
> > > not
> > > > > > quite
> > > > > > > > > > sure,
> > > > > > > > > > >> is
> > > > > > > > > > >> > > that
> > > > > > > > > > >> > > > > > would
> > > > > > > > > > >> > > > > > > > > the new
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > `RemoteLogSizeBytes`
> > > > > > metric
> > > > > > > > > be a
> > > > > > > > > > >> > > > > performance
> > > > > > > > > > >> > > > > > > > > overhead?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Although we
> move
> > > the
> > > > > > > > > calculation
> > > > > > > > > > >> to a
> > > > > > > > > > >> > > > > > seperate
> > > > > > > > > > >> > > > > > > > > API, we
> > > > > > > > > > >> > > > > > > > > > > > > still
> > > > > > > > > > >> > > > > > > > > > > > > > > > can't
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > assume
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > users will
> > > implement a
> > > > > > > > > > >> light-weight
> > > > > > > > > > >> > > > method,
> > > > > > > > > > >> > > > > > > > right?
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1,
> > > 2022 at
> > > > > > 5:47
> > > > > > > > PM
> > > > > > > > > > >> Divij
> > > > > > > > > > >> > > > > Vaidya <
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> divijvaidya13@gmail.com
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Please take
> a
> > > look
> > > > > at
> > > > > > this
> > > > > > > > > KIP
> > > > > > > > > > >> > which
> > > > > > > > > > >> > > > > > proposes
> > > > > > > > > > >> > > > > > > > an
> > > > > > > > > > >> > > > > > > > > > > > > extension
> > > > > > > > > > >> > > > > > > > > > > > > > to
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > This
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > is my first
> > KIP
> > > with
> > > > > > > > Apache
> > > > > > > > > > >> Kafka
> > > > > > > > > > >> > > > > community
> > > > > > > > > > >> > > > > > > so
> > > > > > > > > > >> > > > > > > > > any
> > > > > > > > > > >> > > > > > > > > > > > > feedback
> > > > > > > > > > >> > > > > > > > > > > > > > > > would
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > be
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > highly
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software
> > > > > Engineer
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.
Thanks Divij.

I was confusing with the metric tags used by clients that are based on
topic and partition. Ideally partition label could be at a DEBUG recording
level, but that's outside the scope of this KIP.

Looks good to me, thanks again!

Jorge.

On Wed, 12 Jul 2023 at 15:55, Divij Vaidya <di...@gmail.com> wrote:

> Jorge,
> About API name: Good point. I have changed it to remoteLogSize instead of
> getRemoteLogSize
>
> About partition tag in the metric: We don't use partition tag across any of
> the RemoteStorage metrics and I would like to keep this metric aligned with
> the rest. I will change the metric though to type=BrokerTopicMetrics
> instead of type=RemoteLogManager, since this is topic level information and
> not specific to RemoteLogManager.
>
>
> Satish,
> Ah yes! Updated from "This would increase the broker start-up time." to
> "This would increase the bootstrap time for the remote storage thread pool
> before the first eligible segment is archived."
>
> --
> Divij Vaidya
>
>
>
> On Mon, Jul 3, 2023 at 2:07 PM Satish Duggana <sa...@gmail.com>
> wrote:
>
> > Thanks Divij for taking the feedback and updating the motivation
> > section in the KIP.
> >
> > One more comment on Alternative solution-3, The con is not valid as
> > that will not affect the broker restart times as discussed in the
> > earlier email in this thread. You may want to update that.
> >
> > ~Satish.
> >
> > On Sun, 2 Jul 2023 at 01:03, Divij Vaidya <di...@gmail.com>
> wrote:
> > >
> > > Thank you folks for reviewing this KIP.
> > >
> > > Satish, I have modified the motivation to make it more clear. Now it
> > says,
> > > "Since the main feature of tiered storage is storing a large amount of
> > > data, we expect num_remote_segments to be large. A frequent linear scan
> > > (i.e. listing all segment metadata) could be expensive/slower because
> of
> > > the underlying storage used by RemoteLogMetadataManager. This slowness
> to
> > > list all segment metadata could result in the loss of availability...."
> > >
> > > Jun, Kamal, Satish, if you don't have any further concerns, I would
> > > appreciate a vote for this KIP in the voting thread -
> > > https://lists.apache.org/thread/soz00990gvzodv7oyqj4ysvktrqy6xfk
> > >
> > > --
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Sat, Jul 1, 2023 at 6:16 AM Kamal Chandraprakash <
> > > kamal.chandraprakash@gmail.com> wrote:
> > >
> > > > Hi Divij,
> > > >
> > > > Thanks for the explanation. LGTM.
> > > >
> > > > --
> > > > Kamal
> > > >
> > > > On Sat, Jul 1, 2023 at 7:28 AM Satish Duggana <
> > satish.duggana@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Divij,
> > > > > I am fine with having an API to compute the size as I mentioned in
> my
> > > > > earlier reply in this mail thread. But I have the below comment for
> > > > > the motivation for this KIP.
> > > > >
> > > > > As you discussed offline, the main issue here is listing calls for
> > > > > remote log segment metadata is slower because of the storage used
> for
> > > > > RLMM. These can be avoided with this new API.
> > > > >
> > > > > Please add this in the motivation section as it is one of the main
> > > > > motivations for the KIP.
> > > > >
> > > > > Thanks,
> > > > > Satish.
> > > > >
> > > > > On Sat, 1 Jul 2023 at 01:43, Jun Rao <ju...@confluent.io.invalid>
> > wrote:
> > > > > >
> > > > > > Hi, Divij,
> > > > > >
> > > > > > Sorry for the late reply.
> > > > > >
> > > > > > Given your explanation, the new API sounds reasonable to me. Is
> > that
> > > > > enough
> > > > > > to build the external metadata layer for the remote segments or
> do
> > you
> > > > > need
> > > > > > some additional API changes?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <
> > divijvaidya13@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Thank you for looking into this Kamal.
> > > > > > >
> > > > > > > You are right in saying that a cold start (i.e. leadership
> > failover
> > > > or
> > > > > > > broker startup) does not impact the broker startup duration.
> But
> > it
> > > > > does
> > > > > > > have the following impact:
> > > > > > > 1. It leads to a burst of full-scan requests to RLMM in case
> > multiple
> > > > > > > leadership failovers occur at the same time. Even if the RLMM
> > > > > > > implementation has the capability to serve the total size from
> an
> > > > index
> > > > > > > (and hence handle this burst), we wouldn't be able to use it
> > since
> > > > the
> > > > > > > current API necessarily calls for a full scan.
> > > > > > > 2. The archival (copying of data to tiered storage) process
> will
> > > > have a
> > > > > > > delayed start. The delayed start of archival could lead to
> local
> > > > build
> > > > > up
> > > > > > > of data which may lead to disk full.
> > > > > > >
> > > > > > > The disadvantage of adding this new API is that every provider
> > will
> > > > > have to
> > > > > > > implement it, agreed. But I believe that this tradeoff is
> > worthwhile
> > > > > since
> > > > > > > the default implementation could be the same as you mentioned,
> > i.e.
> > > > > keeping
> > > > > > > cumulative in-memory count.
> > > > > > >
> > > > > > > --
> > > > > > > Divij Vaidya
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
> > > > > > > kamal.chandraprakash@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi Divij,
> > > > > > > >
> > > > > > > > Thanks for the KIP! Sorry for the late reply.
> > > > > > > >
> > > > > > > > Can you explain the rejected alternative-3?
> > > > > > > > Store the cumulative size of remote tier log in-memory at
> > > > > > > RemoteLogManager
> > > > > > > > "*Cons*: Every time a broker starts-up, it will scan through
> > all
> > > > the
> > > > > > > > segments in the remote tier to initialise the in-memory
> value.
> > This
> > > > > would
> > > > > > > > increase the broker start-up time."
> > > > > > > >
> > > > > > > > Keeping the source of truth to determine the remote-log-size
> > in the
> > > > > > > leader
> > > > > > > > would be consistent across different implementations of the
> > plugin.
> > > > > The
> > > > > > > > concern posted in the KIP is that we are calculating the
> > > > > remote-log-size
> > > > > > > on
> > > > > > > > each iteration of the cleaner thread (say 5 mins). If we
> > calculate
> > > > > only
> > > > > > > > once during broker startup or during the leadership
> > reassignment,
> > > > do
> > > > > we
> > > > > > > > still need the cache?
> > > > > > > >
> > > > > > > > The broker startup-time won't be affected by the remote log
> > manager
> > > > > > > > initialisation. The broker continue to start accepting the
> new
> > > > > > > > produce/fetch requests, while the RLM thread in the
> background
> > can
> > > > > > > > determine the remote-log-size once and start copying/deleting
> > the
> > > > > > > segments.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Kamal
> > > > > > > >
> > > > > > > > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <
> > > > divijvaidya13@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Satish / Jun
> > > > > > > > >
> > > > > > > > > Do you have any thoughts on this?
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Divij Vaidya
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <
> > > > > divijvaidya13@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hey Jun
> > > > > > > > > >
> > > > > > > > > > It has been a while since this KIP got some attention.
> > While we
> > > > > wait
> > > > > > > > for
> > > > > > > > > > Satish to chime in here, perhaps I can answer your
> > question.
> > > > > > > > > >
> > > > > > > > > > > Could you explain how you exposed the log size in your
> > > > KIP-405
> > > > > > > > > > implementation?
> > > > > > > > > >
> > > > > > > > > > The APIs available in RLMM as per KIP405
> > > > > > > > > > are, addRemoteLogSegmentMetadata(),
> > > > > updateRemoteLogSegmentMetadata(),
> > > > > > > > > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > > > > > > > > putRemotePartitionDeleteMetadata(),
> listRemoteLogSegments(),
> > > > > > > > > onPartitionLeadershipChanges()
> > > > > > > > > > and onStopPartitions(). None of these APIs allow us to
> > expose
> > > > > the log
> > > > > > > > > size,
> > > > > > > > > > hence, the only option that remains is to list all
> segments
> > > > using
> > > > > > > > > > listRemoteLogSegments() and aggregate them every time we
> > > > require
> > > > > to
> > > > > > > > > > calculate the size. Based on our prior discussion, this
> > > > requires
> > > > > > > > reading
> > > > > > > > > > all segment metadata which won't work for non-local RLMM
> > > > > > > > implementations.
> > > > > > > > > > Satish's implementation also performs a full scan and
> > > > calculates
> > > > > the
> > > > > > > > > > aggregate. see:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Does this answer your question?
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Divij Vaidya
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> Hi, Divij,
> > > > > > > > > >>
> > > > > > > > > >> Thanks for the explanation.
> > > > > > > > > >>
> > > > > > > > > >> Good question.
> > > > > > > > > >>
> > > > > > > > > >> Hi, Satish,
> > > > > > > > > >>
> > > > > > > > > >> Could you explain how you exposed the log size in your
> > KIP-405
> > > > > > > > > >> implementation?
> > > > > > > > > >>
> > > > > > > > > >> Thanks,
> > > > > > > > > >>
> > > > > > > > > >> Jun
> > > > > > > > > >>
> > > > > > > > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <
> > > > > > > divijvaidya13@gmail.com
> > > > > > > > >
> > > > > > > > > >> wrote:
> > > > > > > > > >>
> > > > > > > > > >> > Hey Jun
> > > > > > > > > >> >
> > > > > > > > > >> > Yes, it is possible to maintain the log size in the
> > cache
> > > > (see
> > > > > > > > > rejected
> > > > > > > > > >> > alternative#3 in the KIP) but I did not understand how
> > it is
> > > > > > > > possible
> > > > > > > > > to
> > > > > > > > > >> > retrieve it without the new API. The log size could be
> > > > > calculated
> > > > > > > on
> > > > > > > > > >> > startup by scanning through the segments (though I
> would
> > > > > disagree
> > > > > > > > that
> > > > > > > > > >> this
> > > > > > > > > >> > is the right approach since scanning itself takes
> order
> > of
> > > > > minutes
> > > > > > > > and
> > > > > > > > > >> > hence delay the start of archive process), and
> > incrementally
> > > > > > > > > maintained
> > > > > > > > > >> > afterwards, even then, we would need an API in
> > > > > > > > > RemoteLogMetadataManager
> > > > > > > > > >> so
> > > > > > > > > >> > that RLM could fetch the cached size!
> > > > > > > > > >> >
> > > > > > > > > >> > If we wish to cache the size without adding a new API,
> > then
> > > > we
> > > > > > > need
> > > > > > > > to
> > > > > > > > > >> > cache the size in RLM itself (instead of RLMM
> > > > implementation)
> > > > > and
> > > > > > > > > >> > incrementally manage it. The downside of longer
> archive
> > time
> > > > > at
> > > > > > > > > startup
> > > > > > > > > >> > (due to initial scale) still remains valid in this
> > > > situation.
> > > > > > > > > >> >
> > > > > > > > > >> > --
> > > > > > > > > >> > Divij Vaidya
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao
> > > > > <jun@confluent.io.invalid
> > > > > > > >
> > > > > > > > > >> wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > > Hi, Divij,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thanks for the explanation.
> > > > > > > > > >> > >
> > > > > > > > > >> > > If there is in-memory cache, could we maintain the
> log
> > > > size
> > > > > in
> > > > > > > the
> > > > > > > > > >> cache
> > > > > > > > > >> > > with the existing API? For example, a replica could
> > make a
> > > > > > > > > >> > > listRemoteLogSegments(TopicIdPartition
> > topicIdPartition)
> > > > > call on
> > > > > > > > > >> startup
> > > > > > > > > >> > to
> > > > > > > > > >> > > get the remote segment size before the current
> > > > leaderEpoch.
> > > > > The
> > > > > > > > > leader
> > > > > > > > > >> > > could then maintain the size incrementally
> > afterwards. On
> > > > > leader
> > > > > > > > > >> change,
> > > > > > > > > >> > > other replicas can make a
> > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > >> > > topicIdPartition, int leaderEpoch) call to get the
> > size of
> > > > > newly
> > > > > > > > > >> > generated
> > > > > > > > > >> > > segments.
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thanks,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Jun
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > > >> >
> > > > > > > > > >> > > wrote:
> > > > > > > > > >> > >
> > > > > > > > > >> > > > > Is the new method enough for doing size-based
> > > > retention?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > Yes. You are right in assuming that this API only
> > > > > provides the
> > > > > > > > > >> Remote
> > > > > > > > > >> > > > storage size (for current epoch chain). We would
> use
> > > > this
> > > > > API
> > > > > > > > for
> > > > > > > > > >> size
> > > > > > > > > >> > > > based retention along with a value of
> > > > > localOnlyLogSegmentSize
> > > > > > > > > which
> > > > > > > > > >> is
> > > > > > > > > >> > > > computed as
> > > > > Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > > > > > > > >> > > > highestOffsetWithRemoteIndex)). Hence,
> > (total_log_size =
> > > > > > > > > >> > > > remoteLogSizeBytes +
> log.localOnlyLogSegmentSize). I
> > > > have
> > > > > > > > updated
> > > > > > > > > >> the
> > > > > > > > > >> > KIP
> > > > > > > > > >> > > > with this information. You can also check an
> example
> > > > > > > > > implementation
> > > > > > > > > >> at
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Do you imagine all accesses to remote metadata
> > will be
> > > > > > > across
> > > > > > > > > the
> > > > > > > > > >> > > network
> > > > > > > > > >> > > > or will there be some local in-memory cache?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > I would expect a disk-less implementation to
> > maintain a
> > > > > finite
> > > > > > > > > >> > in-memory
> > > > > > > > > >> > > > cache for segment metadata to optimize the number
> of
> > > > > network
> > > > > > > > calls
> > > > > > > > > >> made
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > fetch the data. In future, we can think about
> > bringing
> > > > > this
> > > > > > > > finite
> > > > > > > > > >> size
> > > > > > > > > >> > > > cache into RLM itself but that's probably a
> > conversation
> > > > > for a
> > > > > > > > > >> > different
> > > > > > > > > >> > > > KIP. There are many other things we would like to
> > do to
> > > > > > > optimize
> > > > > > > > > the
> > > > > > > > > >> > > Tiered
> > > > > > > > > >> > > > storage interface such as introducing a circular
> > buffer
> > > > /
> > > > > > > > > streaming
> > > > > > > > > >> > > > interface from RSM (so that we don't have to wait
> to
> > > > > fetch the
> > > > > > > > > >> entire
> > > > > > > > > >> > > > segment before starting to send records to the
> > > > consumer),
> > > > > > > > caching
> > > > > > > > > >> the
> > > > > > > > > >> > > > segments fetched from RSM locally (I would assume
> > all
> > > > RSM
> > > > > > > plugin
> > > > > > > > > >> > > > implementations to do this, might as well add it
> to
> > RLM)
> > > > > etc.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > --
> > > > > > > > > >> > > > Divij Vaidya
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > >
> > > > > > > > > >> > > wrote:
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Hi, Divij,
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > Thanks for the reply.
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > Is the new method enough for doing size-based
> > > > > retention? It
> > > > > > > > > gives
> > > > > > > > > >> the
> > > > > > > > > >> > > > total
> > > > > > > > > >> > > > > size of the remote segments, but it seems that
> we
> > > > still
> > > > > > > don't
> > > > > > > > > know
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > exact total size for a log since there could be
> > > > > overlapping
> > > > > > > > > >> segments
> > > > > > > > > >> > > > > between the remote and the local segments.
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > You mentioned a disk-less implementation. Do you
> > > > > imagine all
> > > > > > > > > >> accesses
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > > remote metadata will be across the network or
> will
> > > > > there be
> > > > > > > > some
> > > > > > > > > >> > local
> > > > > > > > > >> > > > > in-memory cache?
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > Thanks,
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > Jun
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > > > > > > > > >> divijvaidya13@gmail.com
> > > > > > > > > >> > >
> > > > > > > > > >> > > > > wrote:
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > > The method is needed for RLMM implementations
> > which
> > > > > fetch
> > > > > > > > the
> > > > > > > > > >> > > > information
> > > > > > > > > >> > > > > > over the network and not for the disk based
> > > > > > > implementations
> > > > > > > > > >> (such
> > > > > > > > > >> > as
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > default topic based RLMM).
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > I would argue that adding this API makes the
> > > > interface
> > > > > > > more
> > > > > > > > > >> generic
> > > > > > > > > >> > > > than
> > > > > > > > > >> > > > > > what it is today. This is because, with the
> > current
> > > > > APIs
> > > > > > > an
> > > > > > > > > >> > > implementor
> > > > > > > > > >> > > > > is
> > > > > > > > > >> > > > > > restricted to use disk based RLMM solutions
> only
> > > > > (i.e. the
> > > > > > > > > >> default
> > > > > > > > > >> > > > > > solution) whereas if we add this new API, we
> > unblock
> > > > > usage
> > > > > > > > of
> > > > > > > > > >> > network
> > > > > > > > > >> > > > > based
> > > > > > > > > >> > > > > > RLMM implementations such as databases.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > >> >
> > > > > > > > > >> > > > wrote:
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > > Hi, Divij,
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Thanks for the reply.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Point#2. My high level question is that is
> > the new
> > > > > > > method
> > > > > > > > > >> needed
> > > > > > > > > >> > > for
> > > > > > > > > >> > > > > > every
> > > > > > > > > >> > > > > > > implementation of remote storage or just
> for a
> > > > > specific
> > > > > > > > > >> > > > implementation.
> > > > > > > > > >> > > > > > The
> > > > > > > > > >> > > > > > > issues that you pointed out exist for the
> > default
> > > > > > > > > >> implementation
> > > > > > > > > >> > of
> > > > > > > > > >> > > > > RLMM
> > > > > > > > > >> > > > > > as
> > > > > > > > > >> > > > > > > well and so far, the default implementation
> > hasn't
> > > > > > > found a
> > > > > > > > > >> need
> > > > > > > > > >> > > for a
> > > > > > > > > >> > > > > > > similar new method. For public interface,
> > ideally
> > > > we
> > > > > > > want
> > > > > > > > to
> > > > > > > > > >> make
> > > > > > > > > >> > > it
> > > > > > > > > >> > > > > more
> > > > > > > > > >> > > > > > > general.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Thanks,
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Jun
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij
> Vaidya <
> > > > > > > > > >> > > > divijvaidya13@gmail.com>
> > > > > > > > > >> > > > > > > wrote:
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > Thank you Jun and Alex for your comments.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > Point#1: You are right Jun. As Alex
> > mentioned,
> > > > the
> > > > > > > > > "derived
> > > > > > > > > >> > > > metadata"
> > > > > > > > > >> > > > > > can
> > > > > > > > > >> > > > > > > > increase the size of cached metadata by a
> > factor
> > > > > of 10
> > > > > > > > but
> > > > > > > > > >> it
> > > > > > > > > >> > > > should
> > > > > > > > > >> > > > > be
> > > > > > > > > >> > > > > > > ok
> > > > > > > > > >> > > > > > > > to cache just the actual metadata. My
> point
> > > > about
> > > > > size
> > > > > > > > > >> being a
> > > > > > > > > >> > > > > > limitation
> > > > > > > > > >> > > > > > > > for using cache is not valid anymore.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > Point#2: For a new replica, it would still
> > have
> > > > to
> > > > > > > fetch
> > > > > > > > > the
> > > > > > > > > >> > > > metadata
> > > > > > > > > >> > > > > > > over
> > > > > > > > > >> > > > > > > > the network to initiate the warm up of the
> > cache
> > > > > and
> > > > > > > > > hence,
> > > > > > > > > >> > > > increase
> > > > > > > > > >> > > > > > the
> > > > > > > > > >> > > > > > > > start time of the archival process. Please
> > also
> > > > > note
> > > > > > > the
> > > > > > > > > >> > > > > repercussions
> > > > > > > > > >> > > > > > of
> > > > > > > > > >> > > > > > > > the warm up scan that Alex mentioned in
> this
> > > > > thread as
> > > > > > > > > part
> > > > > > > > > >> of
> > > > > > > > > >> > > > > #102.2.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying
> > that.
> > > > My
> > > > > > > point
> > > > > > > > > >> about
> > > > > > > > > >> > > size
> > > > > > > > > >> > > > > > being
> > > > > > > > > >> > > > > > > a
> > > > > > > > > >> > > > > > > > limitation for using cache is not valid
> > anymore.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > 101#: Alex, if I understand correctly, you
> > are
> > > > > > > > suggesting
> > > > > > > > > to
> > > > > > > > > >> > > cache
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > total size at the leader and update it on
> > > > > archival.
> > > > > > > This
> > > > > > > > > >> > wouldn't
> > > > > > > > > >> > > > > work
> > > > > > > > > >> > > > > > > for
> > > > > > > > > >> > > > > > > > cases when the leader restarts where we
> > would
> > > > > have to
> > > > > > > > > make a
> > > > > > > > > >> > full
> > > > > > > > > >> > > > > scan
> > > > > > > > > >> > > > > > > > to update the total size entry on startup.
> > We
> > > > > expect
> > > > > > > > users
> > > > > > > > > >> to
> > > > > > > > > >> > > store
> > > > > > > > > >> > > > > > data
> > > > > > > > > >> > > > > > > > over longer duration in remote storage
> which
> > > > > increases
> > > > > > > > the
> > > > > > > > > >> > > > likelihood
> > > > > > > > > >> > > > > > of
> > > > > > > > > >> > > > > > > > leader restarts / failovers.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > 102#.1: I don't think that the current
> > design
> > > > > > > > accommodates
> > > > > > > > > >> the
> > > > > > > > > >> > > fact
> > > > > > > > > >> > > > > > that
> > > > > > > > > >> > > > > > > > data corruption could happen at the RLMM
> > plugin
> > > > > (we
> > > > > > > > don't
> > > > > > > > > >> have
> > > > > > > > > >> > > > > checksum
> > > > > > > > > >> > > > > > > as
> > > > > > > > > >> > > > > > > > a field in metadata as part of KIP405). If
> > data
> > > > > > > > corruption
> > > > > > > > > >> > > occurs,
> > > > > > > > > >> > > > w/
> > > > > > > > > >> > > > > > or
> > > > > > > > > >> > > > > > > > w/o the cache, it would be a different
> > problem
> > > > to
> > > > > > > > solve. I
> > > > > > > > > >> > would
> > > > > > > > > >> > > > like
> > > > > > > > > >> > > > > > to
> > > > > > > > > >> > > > > > > > keep this outside the scope of this KIP.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > 102#.2: Agree. This remains as the main
> > concern
> > > > > for
> > > > > > > > using
> > > > > > > > > >> the
> > > > > > > > > >> > > cache
> > > > > > > > > >> > > > > to
> > > > > > > > > >> > > > > > > > fetch total size.
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > Regards,
> > > > > > > > > >> > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre
> > > > > Dupriez <
> > > > > > > > > >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > > Hi Divij,
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > Thanks for the KIP. Please find some
> > comments
> > > > > based
> > > > > > > on
> > > > > > > > > >> what I
> > > > > > > > > >> > > > read
> > > > > > > > > >> > > > > on
> > > > > > > > > >> > > > > > > > > this thread so far - apologies for the
> > repeats
> > > > > and
> > > > > > > the
> > > > > > > > > >> late
> > > > > > > > > >> > > > reply.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > If I understand correctly, one of the
> main
> > > > > elements
> > > > > > > of
> > > > > > > > > >> > > discussion
> > > > > > > > > >> > > > > is
> > > > > > > > > >> > > > > > > > > about caching in Kafka versus delegation
> > of
> > > > > > > providing
> > > > > > > > > the
> > > > > > > > > >> > > remote
> > > > > > > > > >> > > > > size
> > > > > > > > > >> > > > > > > > > of a topic-partition to the plugin.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > A few comments:
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 100. The size of the “derived metadata”
> > which
> > > > is
> > > > > > > > managed
> > > > > > > > > >> by
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > > plugin
> > > > > > > > > >> > > > > > > > > to represent an rlmMetadata can indeed
> be
> > > > close
> > > > > to 1
> > > > > > > > kB
> > > > > > > > > on
> > > > > > > > > >> > > > average
> > > > > > > > > >> > > > > > > > > depending on its own internal structure,
> > e.g.
> > > > > the
> > > > > > > > > >> redundancy
> > > > > > > > > >> > it
> > > > > > > > > >> > > > > > > > > enforces (unfortunately resulting to
> > > > > duplication),
> > > > > > > > > >> additional
> > > > > > > > > >> > > > > > > > > information such as checksums and
> primary
> > and
> > > > > > > > secondary
> > > > > > > > > >> > > indexable
> > > > > > > > > >> > > > > > > > > keys. But indeed, the rlmMetadata is
> > itself a
> > > > > > > lighter
> > > > > > > > > data
> > > > > > > > > >> > > > > structure
> > > > > > > > > >> > > > > > > > > by a factor of 10. And indeed, instead
> of
> > > > > caching
> > > > > > > the
> > > > > > > > > >> > “derived
> > > > > > > > > >> > > > > > > > > metadata”, only the rlmMetadata could
> be,
> > > > which
> > > > > > > should
> > > > > > > > > >> > address
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > > concern regarding the memory occupancy
> of
> > the
> > > > > cache.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 101. I am not sure I fully understand
> why
> > we
> > > > > would
> > > > > > > > need
> > > > > > > > > to
> > > > > > > > > >> > > cache
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > > list of rlmMetadata to retain the remote
> > size
> > > > > of a
> > > > > > > > > >> > > > topic-partition.
> > > > > > > > > >> > > > > > > > > Since the leader of a topic-partition
> is,
> > in
> > > > > > > > > >> non-degenerated
> > > > > > > > > >> > > > cases,
> > > > > > > > > >> > > > > > > > > the only actor which can mutate the
> remote
> > > > part
> > > > > of
> > > > > > > the
> > > > > > > > > >> > > > > > > > > topic-partition, hence its size, it
> could
> > in
> > > > > theory
> > > > > > > > only
> > > > > > > > > >> > cache
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > > size of the remote log once it has
> > calculated
> > > > > it? In
> > > > > > > > > which
> > > > > > > > > >> > case
> > > > > > > > > >> > > > > there
> > > > > > > > > >> > > > > > > > > would not be any problem regarding the
> > size of
> > > > > the
> > > > > > > > > caching
> > > > > > > > > >> > > > > strategy.
> > > > > > > > > >> > > > > > > > > Did I miss something there?
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 102. There may be a few challenges to
> > consider
> > > > > with
> > > > > > > > > >> caching:
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 102.1) As mentioned above, the caching
> > > > strategy
> > > > > > > > assumes
> > > > > > > > > no
> > > > > > > > > >> > > > mutation
> > > > > > > > > >> > > > > > > > > outside the lifetime of a leader. While
> > this
> > > > is
> > > > > true
> > > > > > > > in
> > > > > > > > > >> the
> > > > > > > > > >> > > > normal
> > > > > > > > > >> > > > > > > > > course of operation, there could be
> > accidental
> > > > > > > > mutation
> > > > > > > > > >> > outside
> > > > > > > > > >> > > > of
> > > > > > > > > >> > > > > > the
> > > > > > > > > >> > > > > > > > > leader and a loss of consistency between
> > the
> > > > > cached
> > > > > > > > > state
> > > > > > > > > >> and
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > > > > > actual remote representation of the log.
> > E.g.
> > > > > > > > > split-brain
> > > > > > > > > >> > > > > scenarios,
> > > > > > > > > >> > > > > > > > > bugs in the plugins, bugs in external
> > systems
> > > > > with
> > > > > > > > > >> mutating
> > > > > > > > > >> > > > access
> > > > > > > > > >> > > > > on
> > > > > > > > > >> > > > > > > > > the derived metadata. In the worst
> case, a
> > > > drift
> > > > > > > > between
> > > > > > > > > >> the
> > > > > > > > > >> > > > cached
> > > > > > > > > >> > > > > > > > > size and the actual size could lead to
> > > > > over-deleting
> > > > > > > > > >> remote
> > > > > > > > > >> > > data
> > > > > > > > > >> > > > > > which
> > > > > > > > > >> > > > > > > > > is a durability risk.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > The alternative you propose, by making
> the
> > > > > plugin
> > > > > > > the
> > > > > > > > > >> source
> > > > > > > > > >> > of
> > > > > > > > > >> > > > > truth
> > > > > > > > > >> > > > > > > > > w.r.t. to the size of the remote log,
> can
> > make
> > > > > it
> > > > > > > > easier
> > > > > > > > > >> to
> > > > > > > > > >> > > avoid
> > > > > > > > > >> > > > > > > > > inconsistencies between plugin-managed
> > > > metadata
> > > > > and
> > > > > > > > the
> > > > > > > > > >> > remote
> > > > > > > > > >> > > > log
> > > > > > > > > >> > > > > > > > > from the perspective of Kafka. On the
> > other
> > > > > hand,
> > > > > > > > plugin
> > > > > > > > > >> > > vendors
> > > > > > > > > >> > > > > > would
> > > > > > > > > >> > > > > > > > > have to implement it with the expected
> > > > > efficiency to
> > > > > > > > > have
> > > > > > > > > >> it
> > > > > > > > > >> > > > yield
> > > > > > > > > >> > > > > > > > > benefits.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 102.2) As you mentioned, the caching
> > strategy
> > > > in
> > > > > > > Kafka
> > > > > > > > > >> would
> > > > > > > > > >> > > > still
> > > > > > > > > >> > > > > > > > > require one iteration over the list of
> > > > > rlmMetadata
> > > > > > > > when
> > > > > > > > > >> the
> > > > > > > > > >> > > > > > leadership
> > > > > > > > > >> > > > > > > > > of a topic-partition is assigned to a
> > broker,
> > > > > while
> > > > > > > > the
> > > > > > > > > >> > plugin
> > > > > > > > > >> > > > can
> > > > > > > > > >> > > > > > > > > offer alternative constant-time
> > approaches.
> > > > This
> > > > > > > > > >> calculation
> > > > > > > > > >> > > > cannot
> > > > > > > > > >> > > > > > be
> > > > > > > > > >> > > > > > > > > put on the LeaderAndIsr path and would
> be
> > > > > performed
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > >> > > > > > background.
> > > > > > > > > >> > > > > > > > > In case of bulk leadership migration,
> > listing
> > > > > the
> > > > > > > > > >> rlmMetadata
> > > > > > > > > >> > > > could
> > > > > > > > > >> > > > > > a)
> > > > > > > > > >> > > > > > > > > result in request bursts to any backend
> > system
> > > > > the
> > > > > > > > > plugin
> > > > > > > > > >> may
> > > > > > > > > >> > > use
> > > > > > > > > >> > > > > > > > > [which shouldn’t be a problem for
> > > > > high-throughput
> > > > > > > data
> > > > > > > > > >> stores
> > > > > > > > > >> > > but
> > > > > > > > > >> > > > > > > > > could have cost implications] b)
> increase
> > > > > > > utilisation
> > > > > > > > > >> > timespan
> > > > > > > > > >> > > of
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > > RLM threads for these calculations
> > potentially
> > > > > > > leading
> > > > > > > > > to
> > > > > > > > > >> > > > transient
> > > > > > > > > >> > > > > > > > > starvation of tasks queued for,
> typically,
> > > > > > > offloading
> > > > > > > > > >> > > operations
> > > > > > > > > >> > > > c)
> > > > > > > > > >> > > > > > > > > could have a non-marginal CPU footprint
> on
> > > > > hardware
> > > > > > > > with
> > > > > > > > > >> > strict
> > > > > > > > > >> > > > > > > > > resource constraints. All these elements
> > could
> > > > > have
> > > > > > > an
> > > > > > > > > >> impact
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > > some
> > > > > > > > > >> > > > > > > > > degree depending on the operational
> > > > environment.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > From a design perspective, one question
> is
> > > > > where we
> > > > > > > > want
> > > > > > > > > >> the
> > > > > > > > > >> > > > source
> > > > > > > > > >> > > > > > of
> > > > > > > > > >> > > > > > > > > truth w.r.t. remote log size to be
> during
> > the
> > > > > > > lifetime
> > > > > > > > > of
> > > > > > > > > >> a
> > > > > > > > > >> > > > leader.
> > > > > > > > > >> > > > > > > > > The responsibility of maintaining a
> > consistent
> > > > > > > > > >> representation
> > > > > > > > > >> > > of
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > > remote log is shared by Kafka and the
> > plugin.
> > > > > Which
> > > > > > > > > >> system is
> > > > > > > > > >> > > > best
> > > > > > > > > >> > > > > > > > > placed to maintain such a state while
> > > > providing
> > > > > the
> > > > > > > > > >> highest
> > > > > > > > > >> > > > > > > > > consistency guarantees is something both
> > Kafka
> > > > > and
> > > > > > > > > plugin
> > > > > > > > > >> > > > designers
> > > > > > > > > >> > > > > > > > > could help understand better.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > Many thanks,
> > > > > > > > > >> > > > > > > > > Alexandre
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > > > > > > > > >> > <jun@confluent.io.invalid
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > a
> > > > > > > > > >> > > > > > > > écrit :
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Hi, Divij,
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Thanks for the reply.
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Point #1. Is the average remote
> segment
> > > > > metadata
> > > > > > > > > really
> > > > > > > > > >> > 1KB?
> > > > > > > > > >> > > > > What's
> > > > > > > > > >> > > > > > > > > listed
> > > > > > > > > >> > > > > > > > > > in the public interface is probably
> well
> > > > > below 100
> > > > > > > > > >> bytes.
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Point #2. I guess you are assuming
> that
> > each
> > > > > > > broker
> > > > > > > > > only
> > > > > > > > > >> > > caches
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > > remote
> > > > > > > > > >> > > > > > > > > > segment metadata in memory. An
> > alternative
> > > > > > > approach
> > > > > > > > is
> > > > > > > > > >> to
> > > > > > > > > >> > > cache
> > > > > > > > > >> > > > > > them
> > > > > > > > > >> > > > > > > in
> > > > > > > > > >> > > > > > > > > > both memory and local disk. That way,
> on
> > > > > broker
> > > > > > > > > restart,
> > > > > > > > > >> > you
> > > > > > > > > >> > > > just
> > > > > > > > > >> > > > > > > need
> > > > > > > > > >> > > > > > > > to
> > > > > > > > > >> > > > > > > > > > fetch the new remote segments'
> metadata
> > > > using
> > > > > the
> > > > > > > > > >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > topicIdPartition,
> > > > > > > > > >> > int
> > > > > > > > > >> > > > > > > > leaderEpoch)
> > > > > > > > > >> > > > > > > > > > api. Will that work?
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Point #3. Thanks for the explanation
> > and it
> > > > > sounds
> > > > > > > > > good.
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Thanks,
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Jun
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij
> > > > Vaidya <
> > > > > > > > > >> > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > >> > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > Hi Jun
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > There are three points that I would
> > like
> > > > to
> > > > > > > > present
> > > > > > > > > >> here:
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > 1. We would require a large cache
> > size to
> > > > > > > > > efficiently
> > > > > > > > > >> > cache
> > > > > > > > > >> > > > all
> > > > > > > > > >> > > > > > > > segment
> > > > > > > > > >> > > > > > > > > > > metadata.
> > > > > > > > > >> > > > > > > > > > > 2. Linear scan of all metadata at
> > broker
> > > > > startup
> > > > > > > > to
> > > > > > > > > >> > > populate
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > cache
> > > > > > > > > >> > > > > > > > > will
> > > > > > > > > >> > > > > > > > > > > be slow and will impact the archival
> > > > > process.
> > > > > > > > > >> > > > > > > > > > > 3. There is no other use case where
> a
> > full
> > > > > scan
> > > > > > > of
> > > > > > > > > >> > segment
> > > > > > > > > >> > > > > > metadata
> > > > > > > > > >> > > > > > > > is
> > > > > > > > > >> > > > > > > > > > > required.
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > Let's start by quantifying 1. Here's
> > my
> > > > > estimate
> > > > > > > > for
> > > > > > > > > >> the
> > > > > > > > > >> > > size
> > > > > > > > > >> > > > > of
> > > > > > > > > >> > > > > > > the
> > > > > > > > > >> > > > > > > > > cache.
> > > > > > > > > >> > > > > > > > > > > Average size of segment metadata =
> > 1KB.
> > > > This
> > > > > > > could
> > > > > > > > > be
> > > > > > > > > >> > more
> > > > > > > > > >> > > if
> > > > > > > > > >> > > > > we
> > > > > > > > > >> > > > > > > have
> > > > > > > > > >> > > > > > > > > > > frequent leader failover with a
> large
> > > > > number of
> > > > > > > > > leader
> > > > > > > > > >> > > epochs
> > > > > > > > > >> > > > > > being
> > > > > > > > > >> > > > > > > > > stored
> > > > > > > > > >> > > > > > > > > > > per segment.
> > > > > > > > > >> > > > > > > > > > > Segment size = 100MB. Users will
> > prefer to
> > > > > > > reduce
> > > > > > > > > the
> > > > > > > > > >> > > segment
> > > > > > > > > >> > > > > > size
> > > > > > > > > >> > > > > > > > > from the
> > > > > > > > > >> > > > > > > > > > > default value of 1GB to ensure
> timely
> > > > > archival
> > > > > > > of
> > > > > > > > > data
> > > > > > > > > >> > > since
> > > > > > > > > >> > > > > data
> > > > > > > > > >> > > > > > > > from
> > > > > > > > > >> > > > > > > > > > > active segment is not archived.
> > > > > > > > > >> > > > > > > > > > > Cache size = num segments * avg.
> > segment
> > > > > > > metadata
> > > > > > > > > >> size =
> > > > > > > > > >> > > > > > > > > (100TB/100MB)*1KB
> > > > > > > > > >> > > > > > > > > > > = 1GB.
> > > > > > > > > >> > > > > > > > > > > While 1GB for cache may not sound
> > like a
> > > > > large
> > > > > > > > > number
> > > > > > > > > >> for
> > > > > > > > > >> > > > > larger
> > > > > > > > > >> > > > > > > > > machines,
> > > > > > > > > >> > > > > > > > > > > it does eat into the memory as an
> > > > additional
> > > > > > > cache
> > > > > > > > > and
> > > > > > > > > >> > > makes
> > > > > > > > > >> > > > > use
> > > > > > > > > >> > > > > > > > cases
> > > > > > > > > >> > > > > > > > > with
> > > > > > > > > >> > > > > > > > > > > large data retention with low
> > throughout
> > > > > > > expensive
> > > > > > > > > >> (where
> > > > > > > > > >> > > > such
> > > > > > > > > >> > > > > > use
> > > > > > > > > >> > > > > > > > case
> > > > > > > > > >> > > > > > > > > > > would could use smaller machines).
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > About point#2:
> > > > > > > > > >> > > > > > > > > > > Even if we say that all segment
> > metadata
> > > > > can fit
> > > > > > > > > into
> > > > > > > > > >> the
> > > > > > > > > >> > > > > cache,
> > > > > > > > > >> > > > > > we
> > > > > > > > > >> > > > > > > > > will
> > > > > > > > > >> > > > > > > > > > > need to populate the cache on broker
> > > > > startup. It
> > > > > > > > > would
> > > > > > > > > >> > not
> > > > > > > > > >> > > be
> > > > > > > > > >> > > > > in
> > > > > > > > > >> > > > > > > the
> > > > > > > > > >> > > > > > > > > > > critical patch of broker startup and
> > hence
> > > > > won't
> > > > > > > > > >> impact
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > > startup
> > > > > > > > > >> > > > > > > > > time.
> > > > > > > > > >> > > > > > > > > > > But it will impact the time when we
> > could
> > > > > start
> > > > > > > > the
> > > > > > > > > >> > > archival
> > > > > > > > > >> > > > > > > process
> > > > > > > > > >> > > > > > > > > since
> > > > > > > > > >> > > > > > > > > > > the RLM thread pool will be blocked
> > on the
> > > > > first
> > > > > > > > > call
> > > > > > > > > >> to
> > > > > > > > > >> > > > > > > > > > > listRemoteLogSegments(). To scan
> > metadata
> > > > > for
> > > > > > > 1MM
> > > > > > > > > >> > segments
> > > > > > > > > >> > > > > > > (computed
> > > > > > > > > >> > > > > > > > > above)
> > > > > > > > > >> > > > > > > > > > > and transfer 1GB data over the
> network
> > > > from
> > > > > a
> > > > > > > RLMM
> > > > > > > > > >> such
> > > > > > > > > >> > as
> > > > > > > > > >> > > a
> > > > > > > > > >> > > > > > remote
> > > > > > > > > >> > > > > > > > > > > database would be in the order of
> > minutes
> > > > > > > > (depending
> > > > > > > > > >> on
> > > > > > > > > >> > how
> > > > > > > > > >> > > > > > > efficient
> > > > > > > > > >> > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > scan is with the RLMM
> implementation).
> > > > > > > Although, I
> > > > > > > > > >> would
> > > > > > > > > >> > > > > concede
> > > > > > > > > >> > > > > > > that
> > > > > > > > > >> > > > > > > > > > > having RLM threads blocked for a few
> > > > > minutes is
> > > > > > > > > >> perhaps
> > > > > > > > > >> > OK
> > > > > > > > > >> > > > but
> > > > > > > > > >> > > > > if
> > > > > > > > > >> > > > > > > we
> > > > > > > > > >> > > > > > > > > > > introduce the new API proposed in
> the
> > KIP,
> > > > > we
> > > > > > > > would
> > > > > > > > > >> have
> > > > > > > > > >> > a
> > > > > > > > > >> > > > > > > > > > > deterministic startup time for RLM.
> > Adding
> > > > > the
> > > > > > > API
> > > > > > > > > >> comes
> > > > > > > > > >> > > at a
> > > > > > > > > >> > > > > low
> > > > > > > > > >> > > > > > > > cost
> > > > > > > > > >> > > > > > > > > and
> > > > > > > > > >> > > > > > > > > > > I believe the trade off is worth it.
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > About point#3:
> > > > > > > > > >> > > > > > > > > > > We can use
> > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > >> > > > > > topicIdPartition,
> > > > > > > > > >> > > > > > > > int
> > > > > > > > > >> > > > > > > > > > > leaderEpoch) to calculate the
> segments
> > > > > eligible
> > > > > > > > for
> > > > > > > > > >> > > deletion
> > > > > > > > > >> > > > > > (based
> > > > > > > > > >> > > > > > > > on
> > > > > > > > > >> > > > > > > > > size
> > > > > > > > > >> > > > > > > > > > > retention) where leader epoch(s)
> > belong to
> > > > > the
> > > > > > > > > current
> > > > > > > > > >> > > leader
> > > > > > > > > >> > > > > > epoch
> > > > > > > > > >> > > > > > > > > chain.
> > > > > > > > > >> > > > > > > > > > > I understand that it may lead to
> > segments
> > > > > > > > belonging
> > > > > > > > > to
> > > > > > > > > >> > > other
> > > > > > > > > >> > > > > > epoch
> > > > > > > > > >> > > > > > > > > lineage
> > > > > > > > > >> > > > > > > > > > > not getting deleted and would
> require
> > a
> > > > > separate
> > > > > > > > > >> > mechanism
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > > > > delete
> > > > > > > > > >> > > > > > > > > them.
> > > > > > > > > >> > > > > > > > > > > The separate mechanism would anyways
> > be
> > > > > required
> > > > > > > > to
> > > > > > > > > >> > delete
> > > > > > > > > >> > > > > these
> > > > > > > > > >> > > > > > > > > "leaked"
> > > > > > > > > >> > > > > > > > > > > segments as there are other cases
> > which
> > > > > could
> > > > > > > lead
> > > > > > > > > to
> > > > > > > > > >> > leaks
> > > > > > > > > >> > > > > such
> > > > > > > > > >> > > > > > as
> > > > > > > > > >> > > > > > > > > network
> > > > > > > > > >> > > > > > > > > > > problems with RSM mid way writing
> > through.
> > > > > > > segment
> > > > > > > > > >> etc.
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > Thank you for the replies so far.
> They
> > > > have
> > > > > made
> > > > > > > > me
> > > > > > > > > >> > > re-think
> > > > > > > > > >> > > > my
> > > > > > > > > >> > > > > > > > > assumptions
> > > > > > > > > >> > > > > > > > > > > and this dialogue has been very
> > > > > constructive for
> > > > > > > > me.
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > Regards,
> > > > > > > > > >> > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun
> > Rao
> > > > > > > > > >> > > > > > <jun@confluent.io.invalid
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > Hi, Divij,
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > Thanks for the reply.
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > It's true that the data in Kafka
> > could
> > > > be
> > > > > kept
> > > > > > > > > >> longer
> > > > > > > > > >> > > with
> > > > > > > > > >> > > > > > > KIP-405.
> > > > > > > > > >> > > > > > > > > How
> > > > > > > > > >> > > > > > > > > > > > much data do you envision to have
> > per
> > > > > broker?
> > > > > > > > For
> > > > > > > > > >> 100TB
> > > > > > > > > >> > > > data
> > > > > > > > > >> > > > > > per
> > > > > > > > > >> > > > > > > > > broker,
> > > > > > > > > >> > > > > > > > > > > > with 1GB segment and segment
> > metadata of
> > > > > 100
> > > > > > > > > bytes,
> > > > > > > > > >> it
> > > > > > > > > >> > > > > requires
> > > > > > > > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should
> > fit
> > > > in
> > > > > > > > memory.
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > > > > > > > > >> > listRemoteLogSegments()
> > > > > > > > > >> > > > > > methods.
> > > > > > > > > >> > > > > > > > > The one
> > > > > > > > > >> > > > > > > > > > > > you listed
> > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > >> > > > > > > topicIdPartition,
> > > > > > > > > >> > > > > > > > > int
> > > > > > > > > >> > > > > > > > > > > > leaderEpoch) does return data in
> > offset
> > > > > order.
> > > > > > > > > >> However,
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > > other
> > > > > > > > > >> > > > > > > > > > > > one
> > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > > >> > > > topicIdPartition)
> > > > > > > > > >> > > > > > > > doesn't
> > > > > > > > > >> > > > > > > > > > > > specify the return order. I assume
> > that
> > > > > you
> > > > > > > need
> > > > > > > > > the
> > > > > > > > > >> > > latter
> > > > > > > > > >> > > > > to
> > > > > > > > > >> > > > > > > > > calculate
> > > > > > > > > >> > > > > > > > > > > > the segment size?
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > Thanks,
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > Jun
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM
> > Divij
> > > > > Vaidya
> > > > > > > <
> > > > > > > > > >> > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > >> > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > *Jun,*
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > *"the default implementation of
> > RLMM
> > > > > does
> > > > > > > > local
> > > > > > > > > >> > > caching,
> > > > > > > > > >> > > > > > > right?"*
> > > > > > > > > >> > > > > > > > > > > > > Yes, Jun. The default
> > implementation
> > > > of
> > > > > RLMM
> > > > > > > > > does
> > > > > > > > > >> > > indeed
> > > > > > > > > >> > > > > > cache
> > > > > > > > > >> > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > segment
> > > > > > > > > >> > > > > > > > > > > > > metadata today, hence, it won't
> > work
> > > > > for use
> > > > > > > > > cases
> > > > > > > > > >> > when
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > number
> > > > > > > > > >> > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > segments in remote storage is
> > large
> > > > > enough
> > > > > > > to
> > > > > > > > > >> exceed
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > size
> > > > > > > > > >> > > > > > > of
> > > > > > > > > >> > > > > > > > > cache.
> > > > > > > > > >> > > > > > > > > > > > As
> > > > > > > > > >> > > > > > > > > > > > > part of this KIP, I will
> > implement the
> > > > > new
> > > > > > > > > >> proposed
> > > > > > > > > >> > API
> > > > > > > > > >> > > > in
> > > > > > > > > >> > > > > > the
> > > > > > > > > >> > > > > > > > > default
> > > > > > > > > >> > > > > > > > > > > > > implementation of RLMM but the
> > > > > underlying
> > > > > > > > > >> > > implementation
> > > > > > > > > >> > > > > will
> > > > > > > > > >> > > > > > > > > still be
> > > > > > > > > >> > > > > > > > > > > a
> > > > > > > > > >> > > > > > > > > > > > > scan. I will pick up optimizing
> > that
> > > > in
> > > > > a
> > > > > > > > > separate
> > > > > > > > > >> > PR.
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > *"we also cache all segment
> > metadata
> > > > in
> > > > > the
> > > > > > > > > >> brokers
> > > > > > > > > >> > > > without
> > > > > > > > > >> > > > > > > > > KIP-405. Do
> > > > > > > > > >> > > > > > > > > > > > you
> > > > > > > > > >> > > > > > > > > > > > > see a need to change that?"*
> > > > > > > > > >> > > > > > > > > > > > > Please correct me if I am wrong
> > here
> > > > > but we
> > > > > > > > > cache
> > > > > > > > > >> > > > metadata
> > > > > > > > > >> > > > > > for
> > > > > > > > > >> > > > > > > > > segments
> > > > > > > > > >> > > > > > > > > > > > > "residing in local storage". The
> > size
> > > > > of the
> > > > > > > > > >> current
> > > > > > > > > >> > > > cache
> > > > > > > > > >> > > > > > > works
> > > > > > > > > >> > > > > > > > > fine
> > > > > > > > > >> > > > > > > > > > > for
> > > > > > > > > >> > > > > > > > > > > > > the scale of the number of
> > segments
> > > > > that we
> > > > > > > > > >> expect to
> > > > > > > > > >> > > > store
> > > > > > > > > >> > > > > > in
> > > > > > > > > >> > > > > > > > > local
> > > > > > > > > >> > > > > > > > > > > > > storage. After KIP-405, that
> cache
> > > > will
> > > > > > > > continue
> > > > > > > > > >> to
> > > > > > > > > >> > > store
> > > > > > > > > >> > > > > > > > metadata
> > > > > > > > > >> > > > > > > > > for
> > > > > > > > > >> > > > > > > > > > > > > segments which are residing in
> > local
> > > > > storage
> > > > > > > > and
> > > > > > > > > >> > hence,
> > > > > > > > > >> > > > we
> > > > > > > > > >> > > > > > > don't
> > > > > > > > > >> > > > > > > > > need
> > > > > > > > > >> > > > > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > change that. For segments which
> > have
> > > > > been
> > > > > > > > > >> offloaded
> > > > > > > > > >> > to
> > > > > > > > > >> > > > > remote
> > > > > > > > > >> > > > > > > > > storage,
> > > > > > > > > >> > > > > > > > > > > it
> > > > > > > > > >> > > > > > > > > > > > > would rely on RLMM. Note that
> the
> > > > scale
> > > > > of
> > > > > > > > data
> > > > > > > > > >> > stored
> > > > > > > > > >> > > in
> > > > > > > > > >> > > > > > RLMM
> > > > > > > > > >> > > > > > > is
> > > > > > > > > >> > > > > > > > > > > > different
> > > > > > > > > >> > > > > > > > > > > > > from local cache because the
> > number of
> > > > > > > > segments
> > > > > > > > > is
> > > > > > > > > >> > > > expected
> > > > > > > > > >> > > > > > to
> > > > > > > > > >> > > > > > > be
> > > > > > > > > >> > > > > > > > > much
> > > > > > > > > >> > > > > > > > > > > > > larger than what current
> > > > implementation
> > > > > > > stores
> > > > > > > > > in
> > > > > > > > > >> > local
> > > > > > > > > >> > > > > > > storage.
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > 2,3,4:
> > > > > > > > > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > does
> > > > > > > > > >> > > > > > > > > specify
> > > > > > > > > >> > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > order i.e. it returns the
> segments
> > > > > sorted by
> > > > > > > > > first
> > > > > > > > > >> > > offset
> > > > > > > > > >> > > > > in
> > > > > > > > > >> > > > > > > > > ascending
> > > > > > > > > >> > > > > > > > > > > > > order. I am copying the API docs
> > for
> > > > > KIP-405
> > > > > > > > > here
> > > > > > > > > >> for
> > > > > > > > > >> > > > your
> > > > > > > > > >> > > > > > > > > reference
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > *Returns iterator of remote log
> > > > segment
> > > > > > > > > metadata,
> > > > > > > > > >> > > sorted
> > > > > > > > > >> > > > by
> > > > > > > > > >> > > > > > > > {@link
> > > > > > > > > >> > > > > > > > > > > > >
> > > > RemoteLogSegmentMetadata#startOffset()}
> > > > > > > > > >> inascending
> > > > > > > > > >> > > order
> > > > > > > > > >> > > > > > which
> > > > > > > > > >> > > > > > > > > > > contains
> > > > > > > > > >> > > > > > > > > > > > > the given leader epoch. This is
> > used
> > > > by
> > > > > > > remote
> > > > > > > > > log
> > > > > > > > > >> > > > > retention
> > > > > > > > > >> > > > > > > > > management
> > > > > > > > > >> > > > > > > > > > > > > subsystemto fetch the segment
> > metadata
> > > > > for a
> > > > > > > > > given
> > > > > > > > > >> > > leader
> > > > > > > > > >> > > > > > > > > epoch.@param
> > > > > > > > > >> > > > > > > > > > > > > topicIdPartition topic
> > partition@param
> > > > > > > > > >> leaderEpoch
> > > > > > > > > >> > > > > > leader
> > > > > > > > > >> > > > > > > > > > > > > epoch@return
> > > > > > > > > >> > > > > > > > > > > > > Iterator of remote segments,
> > sorted by
> > > > > start
> > > > > > > > > >> offset
> > > > > > > > > >> > in
> > > > > > > > > >> > > > > > > ascending
> > > > > > > > > >> > > > > > > > > > > order. *
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > *Luke,*
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > 5. Note that we are trying to
> > optimize
> > > > > the
> > > > > > > > > >> efficiency
> > > > > > > > > >> > > of
> > > > > > > > > >> > > > > size
> > > > > > > > > >> > > > > > > > based
> > > > > > > > > >> > > > > > > > > > > > > retention for remote storage.
> > KIP-405
> > > > > does
> > > > > > > not
> > > > > > > > > >> > > introduce
> > > > > > > > > >> > > > a
> > > > > > > > > >> > > > > > new
> > > > > > > > > >> > > > > > > > > config
> > > > > > > > > >> > > > > > > > > > > for
> > > > > > > > > >> > > > > > > > > > > > > periodically checking remote
> > similar
> > > > to
> > > > > > > > > >> > > > > > > > > > > log.retention.check.interval.ms
> > > > > > > > > >> > > > > > > > > > > > > which is applicable for remote
> > > > storage.
> > > > > > > Hence,
> > > > > > > > > the
> > > > > > > > > >> > > metric
> > > > > > > > > >> > > > > > will
> > > > > > > > > >> > > > > > > be
> > > > > > > > > >> > > > > > > > > > > updated
> > > > > > > > > >> > > > > > > > > > > > > at the time of invoking log
> > retention
> > > > > check
> > > > > > > > for
> > > > > > > > > >> > remote
> > > > > > > > > >> > > > tier
> > > > > > > > > >> > > > > > > which
> > > > > > > > > >> > > > > > > > > is
> > > > > > > > > >> > > > > > > > > > > > > pending implementation today. We
> > can
> > > > > perhaps
> > > > > > > > > come
> > > > > > > > > >> > back
> > > > > > > > > >> > > > and
> > > > > > > > > >> > > > > > > update
> > > > > > > > > >> > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > metric description after the
> > > > > implementation
> > > > > > > of
> > > > > > > > > log
> > > > > > > > > >> > > > > retention
> > > > > > > > > >> > > > > > > > check
> > > > > > > > > >> > > > > > > > > in
> > > > > > > > > >> > > > > > > > > > > > > RemoteLogManager.
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > --
> > > > > > > > > >> > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM
> > Luke
> > > > > Chen <
> > > > > > > > > >> > > > > showuon@gmail.com
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > Hi Divij,
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > One more question about the
> > metric:
> > > > > > > > > >> > > > > > > > > > > > > > I think the metric will be
> > updated
> > > > > when
> > > > > > > > > >> > > > > > > > > > > > > > (1) each time we run the log
> > > > retention
> > > > > > > check
> > > > > > > > > >> (that
> > > > > > > > > >> > > is,
> > > > > > > > > >> > > > > > > > > > > > > >
> log.retention.check.interval.ms
> > )
> > > > > > > > > >> > > > > > > > > > > > > > (2) When user explicitly call
> > > > > > > > getRemoteLogSize
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > Is that correct?
> > > > > > > > > >> > > > > > > > > > > > > > Maybe we should add a note in
> > metric
> > > > > > > > > >> description,
> > > > > > > > > >> > > > > > otherwise,
> > > > > > > > > >> > > > > > > > when
> > > > > > > > > >> > > > > > > > > > > user
> > > > > > > > > >> > > > > > > > > > > > > got,
> > > > > > > > > >> > > > > > > > > > > > > > let's say 0 of
> > RemoteLogSizeBytes,
> > > > > will be
> > > > > > > > > >> > surprised.
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > Otherwise, LGTM
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > Thank you for the KIP
> > > > > > > > > >> > > > > > > > > > > > > > Luke
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55
> AM
> > Jun
> > > > > Rao
> > > > > > > > > >> > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > Thanks for the explanation.
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > 1. Hmm, the default
> > implementation
> > > > > of
> > > > > > > RLMM
> > > > > > > > > >> does
> > > > > > > > > >> > > local
> > > > > > > > > >> > > > > > > > caching,
> > > > > > > > > >> > > > > > > > > > > right?
> > > > > > > > > >> > > > > > > > > > > > > > > Currently, we also cache all
> > > > segment
> > > > > > > > > metadata
> > > > > > > > > >> in
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > > > brokers
> > > > > > > > > >> > > > > > > > > > > without
> > > > > > > > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need
> to
> > > > change
> > > > > > > that?
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation
> > makes
> > > > > > > sense.
> > > > > > > > > >> > However,
> > > > > > > > > >> > > > > > > > > > > > > > > currently,
> > > > > > > > > >> > > > > >
> RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > > > > > doesn't
> > > > > > > > > >> > > > > > > > > > > > > > specify
> > > > > > > > > >> > > > > > > > > > > > > > > a particular order of the
> > > > iterator.
> > > > > Do
> > > > > > > you
> > > > > > > > > >> intend
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > > > change
> > > > > > > > > >> > > > > > > > > that?
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > Jun
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31
> AM
> > > > Divij
> > > > > > > > Vaidya
> > > > > > > > > <
> > > > > > > > > >> > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > Hey Jun
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > Thank you for your
> comments.
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor
> could
> > > > ensure
> > > > > > > that
> > > > > > > > > >> > > > > > > > > listRemoteLogSegments()
> > > > > > > > > >> > > > > > > > > > > is
> > > > > > > > > >> > > > > > > > > > > > > > fast"*
> > > > > > > > > >> > > > > > > > > > > > > > > > This would be ideal but
> > > > > pragmatically,
> > > > > > > > it
> > > > > > > > > is
> > > > > > > > > >> > > > > difficult
> > > > > > > > > >> > > > > > to
> > > > > > > > > >> > > > > > > > > ensure
> > > > > > > > > >> > > > > > > > > > > > that
> > > > > > > > > >> > > > > > > > > > > > > > > > listRemoteLogSegments() is
> > fast.
> > > > > This
> > > > > > > is
> > > > > > > > > >> > because
> > > > > > > > > >> > > of
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > > > > > > possibility
> > > > > > > > > >> > > > > > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > > a
> > > > > > > > > >> > > > > > > > > > > > > > > > large number of segments
> > (much
> > > > > larger
> > > > > > > > than
> > > > > > > > > >> what
> > > > > > > > > >> > > > Kafka
> > > > > > > > > >> > > > > > > > > currently
> > > > > > > > > >> > > > > > > > > > > > > handles
> > > > > > > > > >> > > > > > > > > > > > > > > > with local storage today)
> > would
> > > > > make
> > > > > > > it
> > > > > > > > > >> > > infeasible
> > > > > > > > > >> > > > to
> > > > > > > > > >> > > > > > > adopt
> > > > > > > > > >> > > > > > > > > > > > > strategies
> > > > > > > > > >> > > > > > > > > > > > > > > such
> > > > > > > > > >> > > > > > > > > > > > > > > > as local caching to
> improve
> > the
> > > > > > > > > performance
> > > > > > > > > >> of
> > > > > > > > > >> > > > > > > > > > > > listRemoteLogSegments.
> > > > > > > > > >> > > > > > > > > > > > > > > Apart
> > > > > > > > > >> > > > > > > > > > > > > > > > from caching (which won't
> > work
> > > > > due to
> > > > > > > > size
> > > > > > > > > >> > > > > > limitations) I
> > > > > > > > > >> > > > > > > > > can't
> > > > > > > > > >> > > > > > > > > > > > think
> > > > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > > > > other strategies which may
> > > > > eliminate
> > > > > > > the
> > > > > > > > > >> need
> > > > > > > > > >> > for
> > > > > > > > > >> > > > IO
> > > > > > > > > >> > > > > > > > > > > > > > > > operations proportional to
> > the
> > > > > number
> > > > > > > of
> > > > > > > > > >> total
> > > > > > > > > >> > > > > > segments.
> > > > > > > > > >> > > > > > > > > Please
> > > > > > > > > >> > > > > > > > > > > > > advise
> > > > > > > > > >> > > > > > > > > > > > > > if
> > > > > > > > > >> > > > > > > > > > > > > > > > you have something in
> mind.
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds
> > the
> > > > > > > retention
> > > > > > > > > >> size,
> > > > > > > > > >> > we
> > > > > > > > > >> > > > need
> > > > > > > > > >> > > > > > to
> > > > > > > > > >> > > > > > > > > > > determine
> > > > > > > > > >> > > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > > subset of segments to
> > delete to
> > > > > bring
> > > > > > > > the
> > > > > > > > > >> size
> > > > > > > > > >> > > > within
> > > > > > > > > >> > > > > > the
> > > > > > > > > >> > > > > > > > > > > retention
> > > > > > > > > >> > > > > > > > > > > > > > > limit.
> > > > > > > > > >> > > > > > > > > > > > > > > > Do we need to call
> > > > > > > > > >> > > > > > > > > > >
> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > > > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > > > > determine that?"*
> > > > > > > > > >> > > > > > > > > > > > > > > > Yes, we need to call
> > > > > > > > > >> listRemoteLogSegments() to
> > > > > > > > > >> > > > > > determine
> > > > > > > > > >> > > > > > > > > which
> > > > > > > > > >> > > > > > > > > > > > > > segments
> > > > > > > > > >> > > > > > > > > > > > > > > > should be deleted. But
> > there is
> > > > a
> > > > > > > > > difference
> > > > > > > > > >> > with
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > use
> > > > > > > > > >> > > > > > > > > case we
> > > > > > > > > >> > > > > > > > > > > > are
> > > > > > > > > >> > > > > > > > > > > > > > > > trying to optimize with
> this
> > > > KIP.
> > > > > To
> > > > > > > > > >> determine
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > > subset
> > > > > > > > > >> > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > segments
> > > > > > > > > >> > > > > > > > > > > > > > > which
> > > > > > > > > >> > > > > > > > > > > > > > > > would be deleted, we only
> > read
> > > > > > > metadata
> > > > > > > > > for
> > > > > > > > > >> > > > segments
> > > > > > > > > >> > > > > > > which
> > > > > > > > > >> > > > > > > > > would
> > > > > > > > > >> > > > > > > > > > > be
> > > > > > > > > >> > > > > > > > > > > > > > > deleted
> > > > > > > > > >> > > > > > > > > > > > > > > > via the
> > listRemoteLogSegments().
> > > > > But
> > > > > > > to
> > > > > > > > > >> > determine
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > > > > totalLogSize,
> > > > > > > > > >> > > > > > > > > > > > > > which
> > > > > > > > > >> > > > > > > > > > > > > > > > is required every time
> > retention
> > > > > logic
> > > > > > > > > >> based on
> > > > > > > > > >> > > > size
> > > > > > > > > >> > > > > > > > > executes, we
> > > > > > > > > >> > > > > > > > > > > > > read
> > > > > > > > > >> > > > > > > > > > > > > > > > metadata of *all* the
> > segments
> > > > in
> > > > > > > remote
> > > > > > > > > >> > storage.
> > > > > > > > > >> > > > > > Hence,
> > > > > > > > > >> > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > number
> > > > > > > > > >> > > > > > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > > > > results returned by
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > > > > > > > > > > *is
> > > > > > > > > >> > > > > > > > > > > > > > > > different when we are
> > > > calculating
> > > > > > > > > >> totalLogSize
> > > > > > > > > >> > > vs.
> > > > > > > > > >> > > > > when
> > > > > > > > > >> > > > > > > we
> > > > > > > > > >> > > > > > > > > are
> > > > > > > > > >> > > > > > > > > > > > > > > determining
> > > > > > > > > >> > > > > > > > > > > > > > > > the subset of segments to
> > > > delete.
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > 3.
> > > > > > > > > >> > > > > > > > > > > > > > > > *"Also, what about
> > time-based
> > > > > > > retention?
> > > > > > > > > To
> > > > > > > > > >> > make
> > > > > > > > > >> > > > that
> > > > > > > > > >> > > > > > > > > efficient,
> > > > > > > > > >> > > > > > > > > > > do
> > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > > > >> > > > > > > > > > > > > > > > to make some additional
> > > > interface
> > > > > > > > > >> changes?"*No.
> > > > > > > > > >> > > > Note
> > > > > > > > > >> > > > > > that
> > > > > > > > > >> > > > > > > > > time
> > > > > > > > > >> > > > > > > > > > > > > > complexity
> > > > > > > > > >> > > > > > > > > > > > > > > > to determine the segments
> > for
> > > > > > > retention
> > > > > > > > is
> > > > > > > > > >> > > > different
> > > > > > > > > >> > > > > > for
> > > > > > > > > >> > > > > > > > time
> > > > > > > > > >> > > > > > > > > > > based
> > > > > > > > > >> > > > > > > > > > > > > vs.
> > > > > > > > > >> > > > > > > > > > > > > > > > size based. For time
> based,
> > the
> > > > > time
> > > > > > > > > >> complexity
> > > > > > > > > >> > > is
> > > > > > > > > >> > > > a
> > > > > > > > > >> > > > > > > > > function of
> > > > > > > > > >> > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > number
> > > > > > > > > >> > > > > > > > > > > > > > > > of segments which are
> > "eligible
> > > > > for
> > > > > > > > > >> deletion"
> > > > > > > > > >> > > > (since
> > > > > > > > > >> > > > > we
> > > > > > > > > >> > > > > > > > only
> > > > > > > > > >> > > > > > > > > read
> > > > > > > > > >> > > > > > > > > > > > > > > metadata
> > > > > > > > > >> > > > > > > > > > > > > > > > for segments which would
> be
> > > > > deleted)
> > > > > > > > > >> whereas in
> > > > > > > > > >> > > > size
> > > > > > > > > >> > > > > > > based
> > > > > > > > > >> > > > > > > > > > > > retention,
> > > > > > > > > >> > > > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > > time complexity is a
> > function of
> > > > > "all
> > > > > > > > > >> segments"
> > > > > > > > > >> > > > > > available
> > > > > > > > > >> > > > > > > > in
> > > > > > > > > >> > > > > > > > > > > remote
> > > > > > > > > >> > > > > > > > > > > > > > > storage
> > > > > > > > > >> > > > > > > > > > > > > > > > (metadata of all segments
> > needs
> > > > > to be
> > > > > > > > read
> > > > > > > > > >> to
> > > > > > > > > >> > > > > calculate
> > > > > > > > > >> > > > > > > the
> > > > > > > > > >> > > > > > > > > total
> > > > > > > > > >> > > > > > > > > > > > > > size).
> > > > > > > > > >> > > > > > > > > > > > > > > As
> > > > > > > > > >> > > > > > > > > > > > > > > > you may observe, this KIP
> > will
> > > > > bring
> > > > > > > the
> > > > > > > > > >> time
> > > > > > > > > >> > > > > > complexity
> > > > > > > > > >> > > > > > > > for
> > > > > > > > > >> > > > > > > > > both
> > > > > > > > > >> > > > > > > > > > > > > time
> > > > > > > > > >> > > > > > > > > > > > > > > > based retention & size
> based
> > > > > retention
> > > > > > > > to
> > > > > > > > > >> the
> > > > > > > > > >> > > same
> > > > > > > > > >> > > > > > > > function.
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > 4. Also, please note that
> > this
> > > > > new API
> > > > > > > > > >> > introduced
> > > > > > > > > >> > > > in
> > > > > > > > > >> > > > > > this
> > > > > > > > > >> > > > > > > > KIP
> > > > > > > > > >> > > > > > > > > > > also
> > > > > > > > > >> > > > > > > > > > > > > > > enables
> > > > > > > > > >> > > > > > > > > > > > > > > > us to provide a metric for
> > total
> > > > > size
> > > > > > > of
> > > > > > > > > >> data
> > > > > > > > > >> > > > stored
> > > > > > > > > >> > > > > in
> > > > > > > > > >> > > > > > > > > remote
> > > > > > > > > >> > > > > > > > > > > > > storage.
> > > > > > > > > >> > > > > > > > > > > > > > > > Without the API,
> > calculation of
> > > > > this
> > > > > > > > > metric
> > > > > > > > > >> > will
> > > > > > > > > >> > > > > become
> > > > > > > > > >> > > > > > > > very
> > > > > > > > > >> > > > > > > > > > > > > expensive
> > > > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > > > >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> > > > > > > > > >> > > > > > > > > > > > > > > > I understand that your
> > > > motivation
> > > > > here
> > > > > > > > is
> > > > > > > > > to
> > > > > > > > > >> > > avoid
> > > > > > > > > >> > > > > > > > polluting
> > > > > > > > > >> > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > interface
> > > > > > > > > >> > > > > > > > > > > > > > > > with optimization specific
> > APIs
> > > > > and I
> > > > > > > > will
> > > > > > > > > >> > agree
> > > > > > > > > >> > > > with
> > > > > > > > > >> > > > > > > that
> > > > > > > > > >> > > > > > > > > goal.
> > > > > > > > > >> > > > > > > > > > > > But
> > > > > > > > > >> > > > > > > > > > > > > I
> > > > > > > > > >> > > > > > > > > > > > > > > > believe that this new API
> > > > > proposed in
> > > > > > > > the
> > > > > > > > > >> KIP
> > > > > > > > > >> > > > brings
> > > > > > > > > >> > > > > in
> > > > > > > > > >> > > > > > > > > > > significant
> > > > > > > > > >> > > > > > > > > > > > > > > > improvement and there is
> no
> > > > other
> > > > > work
> > > > > > > > > >> around
> > > > > > > > > >> > > > > available
> > > > > > > > > >> > > > > > > to
> > > > > > > > > >> > > > > > > > > > > achieve
> > > > > > > > > >> > > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > same
> > > > > > > > > >> > > > > > > > > > > > > > > > performance.
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > Regards,
> > > > > > > > > >> > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at
> > 12:12 AM
> > > > > Jun
> > > > > > > Rao
> > > > > > > > > >> > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks for the KIP.
> Sorry
> > for
> > > > > the
> > > > > > > late
> > > > > > > > > >> reply.
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > The motivation of the
> KIP
> > is
> > > > to
> > > > > > > > improve
> > > > > > > > > >> the
> > > > > > > > > >> > > > > > efficiency
> > > > > > > > > >> > > > > > > of
> > > > > > > > > >> > > > > > > > > size
> > > > > > > > > >> > > > > > > > > > > > > based
> > > > > > > > > >> > > > > > > > > > > > > > > > > retention. I am not sure
> > the
> > > > > > > proposed
> > > > > > > > > >> changes
> > > > > > > > > >> > > are
> > > > > > > > > >> > > > > > > enough.
> > > > > > > > > >> > > > > > > > > For
> > > > > > > > > >> > > > > > > > > > > > > > example,
> > > > > > > > > >> > > > > > > > > > > > > > > if
> > > > > > > > > >> > > > > > > > > > > > > > > > > the size exceeds the
> > retention
> > > > > size,
> > > > > > > > we
> > > > > > > > > >> need
> > > > > > > > > >> > to
> > > > > > > > > >> > > > > > > determine
> > > > > > > > > >> > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > subset
> > > > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > > > > > segments to delete to
> > bring
> > > > the
> > > > > size
> > > > > > > > > >> within
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > > > retention
> > > > > > > > > >> > > > > > > > > > > limit.
> > > > > > > > > >> > > > > > > > > > > > Do
> > > > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > > > >> > > > > > > > > > > > > > > > need
> > > > > > > > > >> > > > > > > > > > > > > > > > > to call
> > > > > > > > > >> > > > > > >
> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > determine
> > > > > > > > > >> > > > > > > > > > > > > > > > that?
> > > > > > > > > >> > > > > > > > > > > > > > > > > Also, what about
> > time-based
> > > > > > > retention?
> > > > > > > > > To
> > > > > > > > > >> > make
> > > > > > > > > >> > > > that
> > > > > > > > > >> > > > > > > > > efficient,
> > > > > > > > > >> > > > > > > > > > > do
> > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > > > >> > > > > > > > > > > > > > > > > to make some additional
> > > > > interface
> > > > > > > > > changes?
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > An alternative approach
> > is for
> > > > > the
> > > > > > > > RLMM
> > > > > > > > > >> > > > implementor
> > > > > > > > > >> > > > > > to
> > > > > > > > > >> > > > > > > > make
> > > > > > > > > >> > > > > > > > > > > sure
> > > > > > > > > >> > > > > > > > > > > > > > > > > that
> > > > > > > > > >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > >> > > > > > > is
> > > > > > > > > >> > > > > > > > > fast
> > > > > > > > > >> > > > > > > > > > > > > (e.g.,
> > > > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > > > >> > > > > > > > > > > > > > > > > local caching). This
> way,
> > we
> > > > > could
> > > > > > > > keep
> > > > > > > > > >> the
> > > > > > > > > >> > > > > interface
> > > > > > > > > >> > > > > > > > > simple.
> > > > > > > > > >> > > > > > > > > > > > Have
> > > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > > >> > > > > > > > > > > > > > > > > considered that?
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > Jun
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at
> > 6:28
> > > > AM
> > > > > > > Divij
> > > > > > > > > >> Vaidya
> > > > > > > > > >> > <
> > > > > > > > > >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > Does anyone else have
> > any
> > > > > thoughts
> > > > > > > > on
> > > > > > > > > >> this
> > > > > > > > > >> > > > > before I
> > > > > > > > > >> > > > > > > > > propose
> > > > > > > > > >> > > > > > > > > > > > this
> > > > > > > > > >> > > > > > > > > > > > > > for
> > > > > > > > > >> > > > > > > > > > > > > > > a
> > > > > > > > > >> > > > > > > > > > > > > > > > > > vote?
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > --
> > > > > > > > > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at
> > 12:57
> > > > > PM
> > > > > > > > Satish
> > > > > > > > > >> > > Duggana
> > > > > > > > > >> > > > <
> > > > > > > > > >> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP
> > Divij!
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > This is a nice
> > improvement
> > > > > to
> > > > > > > > avoid
> > > > > > > > > >> > > > > recalculation
> > > > > > > > > >> > > > > > > of
> > > > > > > > > >> > > > > > > > > size.
> > > > > > > > > >> > > > > > > > > > > > > > > Customized
> > > > > > > > > >> > > > > > > > > > > > > > > > > > RLMMs
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > can implement the
> best
> > > > > possible
> > > > > > > > > >> approach
> > > > > > > > > >> > by
> > > > > > > > > >> > > > > > caching
> > > > > > > > > >> > > > > > > > or
> > > > > > > > > >> > > > > > > > > > > > > > maintaining
> > > > > > > > > >> > > > > > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > > > > size
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > in an efficient way.
> > But
> > > > > this is
> > > > > > > > > not a
> > > > > > > > > >> > big
> > > > > > > > > >> > > > > > concern
> > > > > > > > > >> > > > > > > > for
> > > > > > > > > >> > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > default
> > > > > > > > > >> > > > > > > > > > > > > > > > > topic
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > based RLMM as
> > mentioned in
> > > > > the
> > > > > > > > KIP.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022
> at
> > > > > 18:48,
> > > > > > > > Divij
> > > > > > > > > >> > Vaidya
> > > > > > > > > >> > > <
> > > > > > > > > >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Thank you for your
> > > > review
> > > > > > > Luke.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that
> > would the
> > > > > new
> > > > > > > > > >> > > > > > `RemoteLogSizeBytes`
> > > > > > > > > >> > > > > > > > > metric
> > > > > > > > > >> > > > > > > > > > > > be a
> > > > > > > > > >> > > > > > > > > > > > > > > > > > performance
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > overhead? Although
> > we
> > > > > move the
> > > > > > > > > >> > > calculation
> > > > > > > > > >> > > > > to a
> > > > > > > > > >> > > > > > > > > seperate
> > > > > > > > > >> > > > > > > > > > > > API,
> > > > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > > > >> > > > > > > > > > > > > > > > > still
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > can't assume users
> > will
> > > > > > > > implement
> > > > > > > > > a
> > > > > > > > > >> > > > > > light-weight
> > > > > > > > > >> > > > > > > > > method,
> > > > > > > > > >> > > > > > > > > > > > > right?
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > This metric would
> be
> > > > > logged
> > > > > > > > using
> > > > > > > > > >> the
> > > > > > > > > >> > > > > > information
> > > > > > > > > >> > > > > > > > > that is
> > > > > > > > > >> > > > > > > > > > > > > > already
> > > > > > > > > >> > > > > > > > > > > > > > > > > being
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > calculated for
> > handling
> > > > > remote
> > > > > > > > > >> > retention
> > > > > > > > > >> > > > > logic,
> > > > > > > > > >> > > > > > > > > hence, no
> > > > > > > > > >> > > > > > > > > > > > > > > > additional
> > > > > > > > > >> > > > > > > > > > > > > > > > > > work
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > is required to
> > calculate
> > > > > this
> > > > > > > > > >> metric.
> > > > > > > > > >> > > More
> > > > > > > > > >> > > > > > > > > specifically,
> > > > > > > > > >> > > > > > > > > > > > > > whenever
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager
> > calls
> > > > > > > > > >> getRemoteLogSize
> > > > > > > > > >> > > > API,
> > > > > > > > > >> > > > > > this
> > > > > > > > > >> > > > > > > > > metric
> > > > > > > > > >> > > > > > > > > > > > > would
> > > > > > > > > >> > > > > > > > > > > > > > be
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > captured.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > This API call is
> > made
> > > > > every
> > > > > > > time
> > > > > > > > > >> > > > > > RemoteLogManager
> > > > > > > > > >> > > > > > > > > wants
> > > > > > > > > >> > > > > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > > handle
> > > > > > > > > >> > > > > > > > > > > > > > > > > > expired
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > remote log
> segments
> > > > (which
> > > > > > > > should
> > > > > > > > > be
> > > > > > > > > >> > > > > periodic).
> > > > > > > > > >> > > > > > > > Does
> > > > > > > > > >> > > > > > > > > that
> > > > > > > > > >> > > > > > > > > > > > > > address
> > > > > > > > > >> > > > > > > > > > > > > > > > > your
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > concern?
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12,
> > 2022 at
> > > > > 11:01
> > > > > > > AM
> > > > > > > > > >> Luke
> > > > > > > > > >> > > Chen
> > > > > > > > > >> > > > <
> > > > > > > > > >> > > > > > > > > > > > > showuon@gmail.com>
> > > > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the
> > KIP!
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > I think it makes
> > sense
> > > > > to
> > > > > > > > > delegate
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > > > > > responsibility
> > > > > > > > > >> > > > > > > > > > > of
> > > > > > > > > >> > > > > > > > > > > > > > > > > calculation
> > > > > > > > > >> > > > > > > > > > > > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > specific
> > > > > > > > > RemoteLogMetadataManager
> > > > > > > > > >> > > > > > > implementation.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > But one thing
> I'm
> > not
> > > > > quite
> > > > > > > > > sure,
> > > > > > > > > >> is
> > > > > > > > > >> > > that
> > > > > > > > > >> > > > > > would
> > > > > > > > > >> > > > > > > > > the new
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > `RemoteLogSizeBytes`
> > > > > metric
> > > > > > > > be a
> > > > > > > > > >> > > > > performance
> > > > > > > > > >> > > > > > > > > overhead?
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Although we move
> > the
> > > > > > > > calculation
> > > > > > > > > >> to a
> > > > > > > > > >> > > > > > seperate
> > > > > > > > > >> > > > > > > > > API, we
> > > > > > > > > >> > > > > > > > > > > > > still
> > > > > > > > > >> > > > > > > > > > > > > > > > can't
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > assume
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > users will
> > implement a
> > > > > > > > > >> light-weight
> > > > > > > > > >> > > > method,
> > > > > > > > > >> > > > > > > > right?
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1,
> > 2022 at
> > > > > 5:47
> > > > > > > PM
> > > > > > > > > >> Divij
> > > > > > > > > >> > > > > Vaidya <
> > > > > > > > > >> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Please take a
> > look
> > > > at
> > > > > this
> > > > > > > > KIP
> > > > > > > > > >> > which
> > > > > > > > > >> > > > > > proposes
> > > > > > > > > >> > > > > > > > an
> > > > > > > > > >> > > > > > > > > > > > > extension
> > > > > > > > > >> > > > > > > > > > > > > > to
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > This
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > is my first
> KIP
> > with
> > > > > > > Apache
> > > > > > > > > >> Kafka
> > > > > > > > > >> > > > > community
> > > > > > > > > >> > > > > > > so
> > > > > > > > > >> > > > > > > > > any
> > > > > > > > > >> > > > > > > > > > > > > feedback
> > > > > > > > > >> > > > > > > > > > > > > > > > would
> > > > > > > > > >> > > > > > > > > > > > > > > > > > be
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > highly
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software
> > > > Engineer
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > >
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Divij Vaidya <di...@gmail.com>.
Jorge,
About API name: Good point. I have changed it to remoteLogSize instead of
getRemoteLogSize

About partition tag in the metric: We don't use partition tag across any of
the RemoteStorage metrics and I would like to keep this metric aligned with
the rest. I will change the metric though to type=BrokerTopicMetrics
instead of type=RemoteLogManager, since this is topic level information and
not specific to RemoteLogManager.


Satish,
Ah yes! Updated from "This would increase the broker start-up time." to
"This would increase the bootstrap time for the remote storage thread pool
before the first eligible segment is archived."

--
Divij Vaidya



On Mon, Jul 3, 2023 at 2:07 PM Satish Duggana <sa...@gmail.com>
wrote:

> Thanks Divij for taking the feedback and updating the motivation
> section in the KIP.
>
> One more comment on Alternative solution-3, The con is not valid as
> that will not affect the broker restart times as discussed in the
> earlier email in this thread. You may want to update that.
>
> ~Satish.
>
> On Sun, 2 Jul 2023 at 01:03, Divij Vaidya <di...@gmail.com> wrote:
> >
> > Thank you folks for reviewing this KIP.
> >
> > Satish, I have modified the motivation to make it more clear. Now it
> says,
> > "Since the main feature of tiered storage is storing a large amount of
> > data, we expect num_remote_segments to be large. A frequent linear scan
> > (i.e. listing all segment metadata) could be expensive/slower because of
> > the underlying storage used by RemoteLogMetadataManager. This slowness to
> > list all segment metadata could result in the loss of availability...."
> >
> > Jun, Kamal, Satish, if you don't have any further concerns, I would
> > appreciate a vote for this KIP in the voting thread -
> > https://lists.apache.org/thread/soz00990gvzodv7oyqj4ysvktrqy6xfk
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Sat, Jul 1, 2023 at 6:16 AM Kamal Chandraprakash <
> > kamal.chandraprakash@gmail.com> wrote:
> >
> > > Hi Divij,
> > >
> > > Thanks for the explanation. LGTM.
> > >
> > > --
> > > Kamal
> > >
> > > On Sat, Jul 1, 2023 at 7:28 AM Satish Duggana <
> satish.duggana@gmail.com>
> > > wrote:
> > >
> > > > Hi Divij,
> > > > I am fine with having an API to compute the size as I mentioned in my
> > > > earlier reply in this mail thread. But I have the below comment for
> > > > the motivation for this KIP.
> > > >
> > > > As you discussed offline, the main issue here is listing calls for
> > > > remote log segment metadata is slower because of the storage used for
> > > > RLMM. These can be avoided with this new API.
> > > >
> > > > Please add this in the motivation section as it is one of the main
> > > > motivations for the KIP.
> > > >
> > > > Thanks,
> > > > Satish.
> > > >
> > > > On Sat, 1 Jul 2023 at 01:43, Jun Rao <ju...@confluent.io.invalid>
> wrote:
> > > > >
> > > > > Hi, Divij,
> > > > >
> > > > > Sorry for the late reply.
> > > > >
> > > > > Given your explanation, the new API sounds reasonable to me. Is
> that
> > > > enough
> > > > > to build the external metadata layer for the remote segments or do
> you
> > > > need
> > > > > some additional API changes?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <
> divijvaidya13@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Thank you for looking into this Kamal.
> > > > > >
> > > > > > You are right in saying that a cold start (i.e. leadership
> failover
> > > or
> > > > > > broker startup) does not impact the broker startup duration. But
> it
> > > > does
> > > > > > have the following impact:
> > > > > > 1. It leads to a burst of full-scan requests to RLMM in case
> multiple
> > > > > > leadership failovers occur at the same time. Even if the RLMM
> > > > > > implementation has the capability to serve the total size from an
> > > index
> > > > > > (and hence handle this burst), we wouldn't be able to use it
> since
> > > the
> > > > > > current API necessarily calls for a full scan.
> > > > > > 2. The archival (copying of data to tiered storage) process will
> > > have a
> > > > > > delayed start. The delayed start of archival could lead to local
> > > build
> > > > up
> > > > > > of data which may lead to disk full.
> > > > > >
> > > > > > The disadvantage of adding this new API is that every provider
> will
> > > > have to
> > > > > > implement it, agreed. But I believe that this tradeoff is
> worthwhile
> > > > since
> > > > > > the default implementation could be the same as you mentioned,
> i.e.
> > > > keeping
> > > > > > cumulative in-memory count.
> > > > > >
> > > > > > --
> > > > > > Divij Vaidya
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
> > > > > > kamal.chandraprakash@gmail.com> wrote:
> > > > > >
> > > > > > > Hi Divij,
> > > > > > >
> > > > > > > Thanks for the KIP! Sorry for the late reply.
> > > > > > >
> > > > > > > Can you explain the rejected alternative-3?
> > > > > > > Store the cumulative size of remote tier log in-memory at
> > > > > > RemoteLogManager
> > > > > > > "*Cons*: Every time a broker starts-up, it will scan through
> all
> > > the
> > > > > > > segments in the remote tier to initialise the in-memory value.
> This
> > > > would
> > > > > > > increase the broker start-up time."
> > > > > > >
> > > > > > > Keeping the source of truth to determine the remote-log-size
> in the
> > > > > > leader
> > > > > > > would be consistent across different implementations of the
> plugin.
> > > > The
> > > > > > > concern posted in the KIP is that we are calculating the
> > > > remote-log-size
> > > > > > on
> > > > > > > each iteration of the cleaner thread (say 5 mins). If we
> calculate
> > > > only
> > > > > > > once during broker startup or during the leadership
> reassignment,
> > > do
> > > > we
> > > > > > > still need the cache?
> > > > > > >
> > > > > > > The broker startup-time won't be affected by the remote log
> manager
> > > > > > > initialisation. The broker continue to start accepting the new
> > > > > > > produce/fetch requests, while the RLM thread in the background
> can
> > > > > > > determine the remote-log-size once and start copying/deleting
> the
> > > > > > segments.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Kamal
> > > > > > >
> > > > > > > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <
> > > divijvaidya13@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Satish / Jun
> > > > > > > >
> > > > > > > > Do you have any thoughts on this?
> > > > > > > >
> > > > > > > > --
> > > > > > > > Divij Vaidya
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <
> > > > divijvaidya13@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Jun
> > > > > > > > >
> > > > > > > > > It has been a while since this KIP got some attention.
> While we
> > > > wait
> > > > > > > for
> > > > > > > > > Satish to chime in here, perhaps I can answer your
> question.
> > > > > > > > >
> > > > > > > > > > Could you explain how you exposed the log size in your
> > > KIP-405
> > > > > > > > > implementation?
> > > > > > > > >
> > > > > > > > > The APIs available in RLMM as per KIP405
> > > > > > > > > are, addRemoteLogSegmentMetadata(),
> > > > updateRemoteLogSegmentMetadata(),
> > > > > > > > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > > > > > > > putRemotePartitionDeleteMetadata(), listRemoteLogSegments(),
> > > > > > > > onPartitionLeadershipChanges()
> > > > > > > > > and onStopPartitions(). None of these APIs allow us to
> expose
> > > > the log
> > > > > > > > size,
> > > > > > > > > hence, the only option that remains is to list all segments
> > > using
> > > > > > > > > listRemoteLogSegments() and aggregate them every time we
> > > require
> > > > to
> > > > > > > > > calculate the size. Based on our prior discussion, this
> > > requires
> > > > > > > reading
> > > > > > > > > all segment metadata which won't work for non-local RLMM
> > > > > > > implementations.
> > > > > > > > > Satish's implementation also performs a full scan and
> > > calculates
> > > > the
> > > > > > > > > aggregate. see:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Does this answer your question?
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Divij Vaidya
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao
> > > <jun@confluent.io.invalid
> > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Hi, Divij,
> > > > > > > > >>
> > > > > > > > >> Thanks for the explanation.
> > > > > > > > >>
> > > > > > > > >> Good question.
> > > > > > > > >>
> > > > > > > > >> Hi, Satish,
> > > > > > > > >>
> > > > > > > > >> Could you explain how you exposed the log size in your
> KIP-405
> > > > > > > > >> implementation?
> > > > > > > > >>
> > > > > > > > >> Thanks,
> > > > > > > > >>
> > > > > > > > >> Jun
> > > > > > > > >>
> > > > > > > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <
> > > > > > divijvaidya13@gmail.com
> > > > > > > >
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >> > Hey Jun
> > > > > > > > >> >
> > > > > > > > >> > Yes, it is possible to maintain the log size in the
> cache
> > > (see
> > > > > > > > rejected
> > > > > > > > >> > alternative#3 in the KIP) but I did not understand how
> it is
> > > > > > > possible
> > > > > > > > to
> > > > > > > > >> > retrieve it without the new API. The log size could be
> > > > calculated
> > > > > > on
> > > > > > > > >> > startup by scanning through the segments (though I would
> > > > disagree
> > > > > > > that
> > > > > > > > >> this
> > > > > > > > >> > is the right approach since scanning itself takes order
> of
> > > > minutes
> > > > > > > and
> > > > > > > > >> > hence delay the start of archive process), and
> incrementally
> > > > > > > > maintained
> > > > > > > > >> > afterwards, even then, we would need an API in
> > > > > > > > RemoteLogMetadataManager
> > > > > > > > >> so
> > > > > > > > >> > that RLM could fetch the cached size!
> > > > > > > > >> >
> > > > > > > > >> > If we wish to cache the size without adding a new API,
> then
> > > we
> > > > > > need
> > > > > > > to
> > > > > > > > >> > cache the size in RLM itself (instead of RLMM
> > > implementation)
> > > > and
> > > > > > > > >> > incrementally manage it. The downside of longer archive
> time
> > > > at
> > > > > > > > startup
> > > > > > > > >> > (due to initial scale) still remains valid in this
> > > situation.
> > > > > > > > >> >
> > > > > > > > >> > --
> > > > > > > > >> > Divij Vaidya
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > > >
> > > > > > > > >> wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Hi, Divij,
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks for the explanation.
> > > > > > > > >> > >
> > > > > > > > >> > > If there is in-memory cache, could we maintain the log
> > > size
> > > > in
> > > > > > the
> > > > > > > > >> cache
> > > > > > > > >> > > with the existing API? For example, a replica could
> make a
> > > > > > > > >> > > listRemoteLogSegments(TopicIdPartition
> topicIdPartition)
> > > > call on
> > > > > > > > >> startup
> > > > > > > > >> > to
> > > > > > > > >> > > get the remote segment size before the current
> > > leaderEpoch.
> > > > The
> > > > > > > > leader
> > > > > > > > >> > > could then maintain the size incrementally
> afterwards. On
> > > > leader
> > > > > > > > >> change,
> > > > > > > > >> > > other replicas can make a
> > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > >> > > topicIdPartition, int leaderEpoch) call to get the
> size of
> > > > newly
> > > > > > > > >> > generated
> > > > > > > > >> > > segments.
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks,
> > > > > > > > >> > >
> > > > > > > > >> > > Jun
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > > > > > > > divijvaidya13@gmail.com
> > > > > > > > >> >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > > Is the new method enough for doing size-based
> > > retention?
> > > > > > > > >> > > >
> > > > > > > > >> > > > Yes. You are right in assuming that this API only
> > > > provides the
> > > > > > > > >> Remote
> > > > > > > > >> > > > storage size (for current epoch chain). We would use
> > > this
> > > > API
> > > > > > > for
> > > > > > > > >> size
> > > > > > > > >> > > > based retention along with a value of
> > > > localOnlyLogSegmentSize
> > > > > > > > which
> > > > > > > > >> is
> > > > > > > > >> > > > computed as
> > > > Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > > > > > > >> > > > highestOffsetWithRemoteIndex)). Hence,
> (total_log_size =
> > > > > > > > >> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I
> > > have
> > > > > > > updated
> > > > > > > > >> the
> > > > > > > > >> > KIP
> > > > > > > > >> > > > with this information. You can also check an example
> > > > > > > > implementation
> > > > > > > > >> at
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Do you imagine all accesses to remote metadata
> will be
> > > > > > across
> > > > > > > > the
> > > > > > > > >> > > network
> > > > > > > > >> > > > or will there be some local in-memory cache?
> > > > > > > > >> > > >
> > > > > > > > >> > > > I would expect a disk-less implementation to
> maintain a
> > > > finite
> > > > > > > > >> > in-memory
> > > > > > > > >> > > > cache for segment metadata to optimize the number of
> > > > network
> > > > > > > calls
> > > > > > > > >> made
> > > > > > > > >> > > to
> > > > > > > > >> > > > fetch the data. In future, we can think about
> bringing
> > > > this
> > > > > > > finite
> > > > > > > > >> size
> > > > > > > > >> > > > cache into RLM itself but that's probably a
> conversation
> > > > for a
> > > > > > > > >> > different
> > > > > > > > >> > > > KIP. There are many other things we would like to
> do to
> > > > > > optimize
> > > > > > > > the
> > > > > > > > >> > > Tiered
> > > > > > > > >> > > > storage interface such as introducing a circular
> buffer
> > > /
> > > > > > > > streaming
> > > > > > > > >> > > > interface from RSM (so that we don't have to wait to
> > > > fetch the
> > > > > > > > >> entire
> > > > > > > > >> > > > segment before starting to send records to the
> > > consumer),
> > > > > > > caching
> > > > > > > > >> the
> > > > > > > > >> > > > segments fetched from RSM locally (I would assume
> all
> > > RSM
> > > > > > plugin
> > > > > > > > >> > > > implementations to do this, might as well add it to
> RLM)
> > > > etc.
> > > > > > > > >> > > >
> > > > > > > > >> > > > --
> > > > > > > > >> > > > Divij Vaidya
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> > > > > > > <jun@confluent.io.invalid
> > > > > > > > >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Hi, Divij,
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Thanks for the reply.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Is the new method enough for doing size-based
> > > > retention? It
> > > > > > > > gives
> > > > > > > > >> the
> > > > > > > > >> > > > total
> > > > > > > > >> > > > > size of the remote segments, but it seems that we
> > > still
> > > > > > don't
> > > > > > > > know
> > > > > > > > >> > the
> > > > > > > > >> > > > > exact total size for a log since there could be
> > > > overlapping
> > > > > > > > >> segments
> > > > > > > > >> > > > > between the remote and the local segments.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > You mentioned a disk-less implementation. Do you
> > > > imagine all
> > > > > > > > >> accesses
> > > > > > > > >> > > to
> > > > > > > > >> > > > > remote metadata will be across the network or will
> > > > there be
> > > > > > > some
> > > > > > > > >> > local
> > > > > > > > >> > > > > in-memory cache?
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Thanks,
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Jun
> > > > > > > > >> > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > > > > > > > >> divijvaidya13@gmail.com
> > > > > > > > >> > >
> > > > > > > > >> > > > > wrote:
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > > The method is needed for RLMM implementations
> which
> > > > fetch
> > > > > > > the
> > > > > > > > >> > > > information
> > > > > > > > >> > > > > > over the network and not for the disk based
> > > > > > implementations
> > > > > > > > >> (such
> > > > > > > > >> > as
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > > default topic based RLMM).
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > I would argue that adding this API makes the
> > > interface
> > > > > > more
> > > > > > > > >> generic
> > > > > > > > >> > > > than
> > > > > > > > >> > > > > > what it is today. This is because, with the
> current
> > > > APIs
> > > > > > an
> > > > > > > > >> > > implementor
> > > > > > > > >> > > > > is
> > > > > > > > >> > > > > > restricted to use disk based RLMM solutions only
> > > > (i.e. the
> > > > > > > > >> default
> > > > > > > > >> > > > > > solution) whereas if we add this new API, we
> unblock
> > > > usage
> > > > > > > of
> > > > > > > > >> > network
> > > > > > > > >> > > > > based
> > > > > > > > >> > > > > > RLMM implementations such as databases.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > > > > > > > <jun@confluent.io.invalid
> > > > > > > > >> >
> > > > > > > > >> > > > wrote:
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > > Hi, Divij,
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > Thanks for the reply.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > Point#2. My high level question is that is
> the new
> > > > > > method
> > > > > > > > >> needed
> > > > > > > > >> > > for
> > > > > > > > >> > > > > > every
> > > > > > > > >> > > > > > > implementation of remote storage or just for a
> > > > specific
> > > > > > > > >> > > > implementation.
> > > > > > > > >> > > > > > The
> > > > > > > > >> > > > > > > issues that you pointed out exist for the
> default
> > > > > > > > >> implementation
> > > > > > > > >> > of
> > > > > > > > >> > > > > RLMM
> > > > > > > > >> > > > > > as
> > > > > > > > >> > > > > > > well and so far, the default implementation
> hasn't
> > > > > > found a
> > > > > > > > >> need
> > > > > > > > >> > > for a
> > > > > > > > >> > > > > > > similar new method. For public interface,
> ideally
> > > we
> > > > > > want
> > > > > > > to
> > > > > > > > >> make
> > > > > > > > >> > > it
> > > > > > > > >> > > > > more
> > > > > > > > >> > > > > > > general.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > Thanks,
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > Jun
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
> > > > > > > > >> > > > divijvaidya13@gmail.com>
> > > > > > > > >> > > > > > > wrote:
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > Thank you Jun and Alex for your comments.
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > Point#1: You are right Jun. As Alex
> mentioned,
> > > the
> > > > > > > > "derived
> > > > > > > > >> > > > metadata"
> > > > > > > > >> > > > > > can
> > > > > > > > >> > > > > > > > increase the size of cached metadata by a
> factor
> > > > of 10
> > > > > > > but
> > > > > > > > >> it
> > > > > > > > >> > > > should
> > > > > > > > >> > > > > be
> > > > > > > > >> > > > > > > ok
> > > > > > > > >> > > > > > > > to cache just the actual metadata. My point
> > > about
> > > > size
> > > > > > > > >> being a
> > > > > > > > >> > > > > > limitation
> > > > > > > > >> > > > > > > > for using cache is not valid anymore.
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > Point#2: For a new replica, it would still
> have
> > > to
> > > > > > fetch
> > > > > > > > the
> > > > > > > > >> > > > metadata
> > > > > > > > >> > > > > > > over
> > > > > > > > >> > > > > > > > the network to initiate the warm up of the
> cache
> > > > and
> > > > > > > > hence,
> > > > > > > > >> > > > increase
> > > > > > > > >> > > > > > the
> > > > > > > > >> > > > > > > > start time of the archival process. Please
> also
> > > > note
> > > > > > the
> > > > > > > > >> > > > > repercussions
> > > > > > > > >> > > > > > of
> > > > > > > > >> > > > > > > > the warm up scan that Alex mentioned in this
> > > > thread as
> > > > > > > > part
> > > > > > > > >> of
> > > > > > > > >> > > > > #102.2.
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying
> that.
> > > My
> > > > > > point
> > > > > > > > >> about
> > > > > > > > >> > > size
> > > > > > > > >> > > > > > being
> > > > > > > > >> > > > > > > a
> > > > > > > > >> > > > > > > > limitation for using cache is not valid
> anymore.
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > 101#: Alex, if I understand correctly, you
> are
> > > > > > > suggesting
> > > > > > > > to
> > > > > > > > >> > > cache
> > > > > > > > >> > > > > the
> > > > > > > > >> > > > > > > > total size at the leader and update it on
> > > > archival.
> > > > > > This
> > > > > > > > >> > wouldn't
> > > > > > > > >> > > > > work
> > > > > > > > >> > > > > > > for
> > > > > > > > >> > > > > > > > cases when the leader restarts where we
> would
> > > > have to
> > > > > > > > make a
> > > > > > > > >> > full
> > > > > > > > >> > > > > scan
> > > > > > > > >> > > > > > > > to update the total size entry on startup.
> We
> > > > expect
> > > > > > > users
> > > > > > > > >> to
> > > > > > > > >> > > store
> > > > > > > > >> > > > > > data
> > > > > > > > >> > > > > > > > over longer duration in remote storage which
> > > > increases
> > > > > > > the
> > > > > > > > >> > > > likelihood
> > > > > > > > >> > > > > > of
> > > > > > > > >> > > > > > > > leader restarts / failovers.
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > 102#.1: I don't think that the current
> design
> > > > > > > accommodates
> > > > > > > > >> the
> > > > > > > > >> > > fact
> > > > > > > > >> > > > > > that
> > > > > > > > >> > > > > > > > data corruption could happen at the RLMM
> plugin
> > > > (we
> > > > > > > don't
> > > > > > > > >> have
> > > > > > > > >> > > > > checksum
> > > > > > > > >> > > > > > > as
> > > > > > > > >> > > > > > > > a field in metadata as part of KIP405). If
> data
> > > > > > > corruption
> > > > > > > > >> > > occurs,
> > > > > > > > >> > > > w/
> > > > > > > > >> > > > > > or
> > > > > > > > >> > > > > > > > w/o the cache, it would be a different
> problem
> > > to
> > > > > > > solve. I
> > > > > > > > >> > would
> > > > > > > > >> > > > like
> > > > > > > > >> > > > > > to
> > > > > > > > >> > > > > > > > keep this outside the scope of this KIP.
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > 102#.2: Agree. This remains as the main
> concern
> > > > for
> > > > > > > using
> > > > > > > > >> the
> > > > > > > > >> > > cache
> > > > > > > > >> > > > > to
> > > > > > > > >> > > > > > > > fetch total size.
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > Regards,
> > > > > > > > >> > > > > > > > Divij Vaidya
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre
> > > > Dupriez <
> > > > > > > > >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > > Hi Divij,
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > Thanks for the KIP. Please find some
> comments
> > > > based
> > > > > > on
> > > > > > > > >> what I
> > > > > > > > >> > > > read
> > > > > > > > >> > > > > on
> > > > > > > > >> > > > > > > > > this thread so far - apologies for the
> repeats
> > > > and
> > > > > > the
> > > > > > > > >> late
> > > > > > > > >> > > > reply.
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > If I understand correctly, one of the main
> > > > elements
> > > > > > of
> > > > > > > > >> > > discussion
> > > > > > > > >> > > > > is
> > > > > > > > >> > > > > > > > > about caching in Kafka versus delegation
> of
> > > > > > providing
> > > > > > > > the
> > > > > > > > >> > > remote
> > > > > > > > >> > > > > size
> > > > > > > > >> > > > > > > > > of a topic-partition to the plugin.
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > A few comments:
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > 100. The size of the “derived metadata”
> which
> > > is
> > > > > > > managed
> > > > > > > > >> by
> > > > > > > > >> > the
> > > > > > > > >> > > > > > plugin
> > > > > > > > >> > > > > > > > > to represent an rlmMetadata can indeed be
> > > close
> > > > to 1
> > > > > > > kB
> > > > > > > > on
> > > > > > > > >> > > > average
> > > > > > > > >> > > > > > > > > depending on its own internal structure,
> e.g.
> > > > the
> > > > > > > > >> redundancy
> > > > > > > > >> > it
> > > > > > > > >> > > > > > > > > enforces (unfortunately resulting to
> > > > duplication),
> > > > > > > > >> additional
> > > > > > > > >> > > > > > > > > information such as checksums and primary
> and
> > > > > > > secondary
> > > > > > > > >> > > indexable
> > > > > > > > >> > > > > > > > > keys. But indeed, the rlmMetadata is
> itself a
> > > > > > lighter
> > > > > > > > data
> > > > > > > > >> > > > > structure
> > > > > > > > >> > > > > > > > > by a factor of 10. And indeed, instead of
> > > > caching
> > > > > > the
> > > > > > > > >> > “derived
> > > > > > > > >> > > > > > > > > metadata”, only the rlmMetadata could be,
> > > which
> > > > > > should
> > > > > > > > >> > address
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > > > > > concern regarding the memory occupancy of
> the
> > > > cache.
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > 101. I am not sure I fully understand why
> we
> > > > would
> > > > > > > need
> > > > > > > > to
> > > > > > > > >> > > cache
> > > > > > > > >> > > > > the
> > > > > > > > >> > > > > > > > > list of rlmMetadata to retain the remote
> size
> > > > of a
> > > > > > > > >> > > > topic-partition.
> > > > > > > > >> > > > > > > > > Since the leader of a topic-partition is,
> in
> > > > > > > > >> non-degenerated
> > > > > > > > >> > > > cases,
> > > > > > > > >> > > > > > > > > the only actor which can mutate the remote
> > > part
> > > > of
> > > > > > the
> > > > > > > > >> > > > > > > > > topic-partition, hence its size, it could
> in
> > > > theory
> > > > > > > only
> > > > > > > > >> > cache
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > > > > > size of the remote log once it has
> calculated
> > > > it? In
> > > > > > > > which
> > > > > > > > >> > case
> > > > > > > > >> > > > > there
> > > > > > > > >> > > > > > > > > would not be any problem regarding the
> size of
> > > > the
> > > > > > > > caching
> > > > > > > > >> > > > > strategy.
> > > > > > > > >> > > > > > > > > Did I miss something there?
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > 102. There may be a few challenges to
> consider
> > > > with
> > > > > > > > >> caching:
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > 102.1) As mentioned above, the caching
> > > strategy
> > > > > > > assumes
> > > > > > > > no
> > > > > > > > >> > > > mutation
> > > > > > > > >> > > > > > > > > outside the lifetime of a leader. While
> this
> > > is
> > > > true
> > > > > > > in
> > > > > > > > >> the
> > > > > > > > >> > > > normal
> > > > > > > > >> > > > > > > > > course of operation, there could be
> accidental
> > > > > > > mutation
> > > > > > > > >> > outside
> > > > > > > > >> > > > of
> > > > > > > > >> > > > > > the
> > > > > > > > >> > > > > > > > > leader and a loss of consistency between
> the
> > > > cached
> > > > > > > > state
> > > > > > > > >> and
> > > > > > > > >> > > the
> > > > > > > > >> > > > > > > > > actual remote representation of the log.
> E.g.
> > > > > > > > split-brain
> > > > > > > > >> > > > > scenarios,
> > > > > > > > >> > > > > > > > > bugs in the plugins, bugs in external
> systems
> > > > with
> > > > > > > > >> mutating
> > > > > > > > >> > > > access
> > > > > > > > >> > > > > on
> > > > > > > > >> > > > > > > > > the derived metadata. In the worst case, a
> > > drift
> > > > > > > between
> > > > > > > > >> the
> > > > > > > > >> > > > cached
> > > > > > > > >> > > > > > > > > size and the actual size could lead to
> > > > over-deleting
> > > > > > > > >> remote
> > > > > > > > >> > > data
> > > > > > > > >> > > > > > which
> > > > > > > > >> > > > > > > > > is a durability risk.
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > The alternative you propose, by making the
> > > > plugin
> > > > > > the
> > > > > > > > >> source
> > > > > > > > >> > of
> > > > > > > > >> > > > > truth
> > > > > > > > >> > > > > > > > > w.r.t. to the size of the remote log, can
> make
> > > > it
> > > > > > > easier
> > > > > > > > >> to
> > > > > > > > >> > > avoid
> > > > > > > > >> > > > > > > > > inconsistencies between plugin-managed
> > > metadata
> > > > and
> > > > > > > the
> > > > > > > > >> > remote
> > > > > > > > >> > > > log
> > > > > > > > >> > > > > > > > > from the perspective of Kafka. On the
> other
> > > > hand,
> > > > > > > plugin
> > > > > > > > >> > > vendors
> > > > > > > > >> > > > > > would
> > > > > > > > >> > > > > > > > > have to implement it with the expected
> > > > efficiency to
> > > > > > > > have
> > > > > > > > >> it
> > > > > > > > >> > > > yield
> > > > > > > > >> > > > > > > > > benefits.
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > 102.2) As you mentioned, the caching
> strategy
> > > in
> > > > > > Kafka
> > > > > > > > >> would
> > > > > > > > >> > > > still
> > > > > > > > >> > > > > > > > > require one iteration over the list of
> > > > rlmMetadata
> > > > > > > when
> > > > > > > > >> the
> > > > > > > > >> > > > > > leadership
> > > > > > > > >> > > > > > > > > of a topic-partition is assigned to a
> broker,
> > > > while
> > > > > > > the
> > > > > > > > >> > plugin
> > > > > > > > >> > > > can
> > > > > > > > >> > > > > > > > > offer alternative constant-time
> approaches.
> > > This
> > > > > > > > >> calculation
> > > > > > > > >> > > > cannot
> > > > > > > > >> > > > > > be
> > > > > > > > >> > > > > > > > > put on the LeaderAndIsr path and would be
> > > > performed
> > > > > > in
> > > > > > > > the
> > > > > > > > >> > > > > > background.
> > > > > > > > >> > > > > > > > > In case of bulk leadership migration,
> listing
> > > > the
> > > > > > > > >> rlmMetadata
> > > > > > > > >> > > > could
> > > > > > > > >> > > > > > a)
> > > > > > > > >> > > > > > > > > result in request bursts to any backend
> system
> > > > the
> > > > > > > > plugin
> > > > > > > > >> may
> > > > > > > > >> > > use
> > > > > > > > >> > > > > > > > > [which shouldn’t be a problem for
> > > > high-throughput
> > > > > > data
> > > > > > > > >> stores
> > > > > > > > >> > > but
> > > > > > > > >> > > > > > > > > could have cost implications] b) increase
> > > > > > utilisation
> > > > > > > > >> > timespan
> > > > > > > > >> > > of
> > > > > > > > >> > > > > the
> > > > > > > > >> > > > > > > > > RLM threads for these calculations
> potentially
> > > > > > leading
> > > > > > > > to
> > > > > > > > >> > > > transient
> > > > > > > > >> > > > > > > > > starvation of tasks queued for, typically,
> > > > > > offloading
> > > > > > > > >> > > operations
> > > > > > > > >> > > > c)
> > > > > > > > >> > > > > > > > > could have a non-marginal CPU footprint on
> > > > hardware
> > > > > > > with
> > > > > > > > >> > strict
> > > > > > > > >> > > > > > > > > resource constraints. All these elements
> could
> > > > have
> > > > > > an
> > > > > > > > >> impact
> > > > > > > > >> > > to
> > > > > > > > >> > > > > some
> > > > > > > > >> > > > > > > > > degree depending on the operational
> > > environment.
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > From a design perspective, one question is
> > > > where we
> > > > > > > want
> > > > > > > > >> the
> > > > > > > > >> > > > source
> > > > > > > > >> > > > > > of
> > > > > > > > >> > > > > > > > > truth w.r.t. remote log size to be during
> the
> > > > > > lifetime
> > > > > > > > of
> > > > > > > > >> a
> > > > > > > > >> > > > leader.
> > > > > > > > >> > > > > > > > > The responsibility of maintaining a
> consistent
> > > > > > > > >> representation
> > > > > > > > >> > > of
> > > > > > > > >> > > > > the
> > > > > > > > >> > > > > > > > > remote log is shared by Kafka and the
> plugin.
> > > > Which
> > > > > > > > >> system is
> > > > > > > > >> > > > best
> > > > > > > > >> > > > > > > > > placed to maintain such a state while
> > > providing
> > > > the
> > > > > > > > >> highest
> > > > > > > > >> > > > > > > > > consistency guarantees is something both
> Kafka
> > > > and
> > > > > > > > plugin
> > > > > > > > >> > > > designers
> > > > > > > > >> > > > > > > > > could help understand better.
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > Many thanks,
> > > > > > > > >> > > > > > > > > Alexandre
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > > > > > > > >> > <jun@confluent.io.invalid
> > > > > > > > >> > > >
> > > > > > > > >> > > > a
> > > > > > > > >> > > > > > > > écrit :
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > Hi, Divij,
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > Thanks for the reply.
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > Point #1. Is the average remote segment
> > > > metadata
> > > > > > > > really
> > > > > > > > >> > 1KB?
> > > > > > > > >> > > > > What's
> > > > > > > > >> > > > > > > > > listed
> > > > > > > > >> > > > > > > > > > in the public interface is probably well
> > > > below 100
> > > > > > > > >> bytes.
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > Point #2. I guess you are assuming that
> each
> > > > > > broker
> > > > > > > > only
> > > > > > > > >> > > caches
> > > > > > > > >> > > > > the
> > > > > > > > >> > > > > > > > > remote
> > > > > > > > >> > > > > > > > > > segment metadata in memory. An
> alternative
> > > > > > approach
> > > > > > > is
> > > > > > > > >> to
> > > > > > > > >> > > cache
> > > > > > > > >> > > > > > them
> > > > > > > > >> > > > > > > in
> > > > > > > > >> > > > > > > > > > both memory and local disk. That way, on
> > > > broker
> > > > > > > > restart,
> > > > > > > > >> > you
> > > > > > > > >> > > > just
> > > > > > > > >> > > > > > > need
> > > > > > > > >> > > > > > > > to
> > > > > > > > >> > > > > > > > > > fetch the new remote segments' metadata
> > > using
> > > > the
> > > > > > > > >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > topicIdPartition,
> > > > > > > > >> > int
> > > > > > > > >> > > > > > > > leaderEpoch)
> > > > > > > > >> > > > > > > > > > api. Will that work?
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > Point #3. Thanks for the explanation
> and it
> > > > sounds
> > > > > > > > good.
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > Thanks,
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > Jun
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij
> > > Vaidya <
> > > > > > > > >> > > > > > > divijvaidya13@gmail.com>
> > > > > > > > >> > > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > Hi Jun
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > There are three points that I would
> like
> > > to
> > > > > > > present
> > > > > > > > >> here:
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > 1. We would require a large cache
> size to
> > > > > > > > efficiently
> > > > > > > > >> > cache
> > > > > > > > >> > > > all
> > > > > > > > >> > > > > > > > segment
> > > > > > > > >> > > > > > > > > > > metadata.
> > > > > > > > >> > > > > > > > > > > 2. Linear scan of all metadata at
> broker
> > > > startup
> > > > > > > to
> > > > > > > > >> > > populate
> > > > > > > > >> > > > > the
> > > > > > > > >> > > > > > > > cache
> > > > > > > > >> > > > > > > > > will
> > > > > > > > >> > > > > > > > > > > be slow and will impact the archival
> > > > process.
> > > > > > > > >> > > > > > > > > > > 3. There is no other use case where a
> full
> > > > scan
> > > > > > of
> > > > > > > > >> > segment
> > > > > > > > >> > > > > > metadata
> > > > > > > > >> > > > > > > > is
> > > > > > > > >> > > > > > > > > > > required.
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > Let's start by quantifying 1. Here's
> my
> > > > estimate
> > > > > > > for
> > > > > > > > >> the
> > > > > > > > >> > > size
> > > > > > > > >> > > > > of
> > > > > > > > >> > > > > > > the
> > > > > > > > >> > > > > > > > > cache.
> > > > > > > > >> > > > > > > > > > > Average size of segment metadata =
> 1KB.
> > > This
> > > > > > could
> > > > > > > > be
> > > > > > > > >> > more
> > > > > > > > >> > > if
> > > > > > > > >> > > > > we
> > > > > > > > >> > > > > > > have
> > > > > > > > >> > > > > > > > > > > frequent leader failover with a large
> > > > number of
> > > > > > > > leader
> > > > > > > > >> > > epochs
> > > > > > > > >> > > > > > being
> > > > > > > > >> > > > > > > > > stored
> > > > > > > > >> > > > > > > > > > > per segment.
> > > > > > > > >> > > > > > > > > > > Segment size = 100MB. Users will
> prefer to
> > > > > > reduce
> > > > > > > > the
> > > > > > > > >> > > segment
> > > > > > > > >> > > > > > size
> > > > > > > > >> > > > > > > > > from the
> > > > > > > > >> > > > > > > > > > > default value of 1GB to ensure timely
> > > > archival
> > > > > > of
> > > > > > > > data
> > > > > > > > >> > > since
> > > > > > > > >> > > > > data
> > > > > > > > >> > > > > > > > from
> > > > > > > > >> > > > > > > > > > > active segment is not archived.
> > > > > > > > >> > > > > > > > > > > Cache size = num segments * avg.
> segment
> > > > > > metadata
> > > > > > > > >> size =
> > > > > > > > >> > > > > > > > > (100TB/100MB)*1KB
> > > > > > > > >> > > > > > > > > > > = 1GB.
> > > > > > > > >> > > > > > > > > > > While 1GB for cache may not sound
> like a
> > > > large
> > > > > > > > number
> > > > > > > > >> for
> > > > > > > > >> > > > > larger
> > > > > > > > >> > > > > > > > > machines,
> > > > > > > > >> > > > > > > > > > > it does eat into the memory as an
> > > additional
> > > > > > cache
> > > > > > > > and
> > > > > > > > >> > > makes
> > > > > > > > >> > > > > use
> > > > > > > > >> > > > > > > > cases
> > > > > > > > >> > > > > > > > > with
> > > > > > > > >> > > > > > > > > > > large data retention with low
> throughout
> > > > > > expensive
> > > > > > > > >> (where
> > > > > > > > >> > > > such
> > > > > > > > >> > > > > > use
> > > > > > > > >> > > > > > > > case
> > > > > > > > >> > > > > > > > > > > would could use smaller machines).
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > About point#2:
> > > > > > > > >> > > > > > > > > > > Even if we say that all segment
> metadata
> > > > can fit
> > > > > > > > into
> > > > > > > > >> the
> > > > > > > > >> > > > > cache,
> > > > > > > > >> > > > > > we
> > > > > > > > >> > > > > > > > > will
> > > > > > > > >> > > > > > > > > > > need to populate the cache on broker
> > > > startup. It
> > > > > > > > would
> > > > > > > > >> > not
> > > > > > > > >> > > be
> > > > > > > > >> > > > > in
> > > > > > > > >> > > > > > > the
> > > > > > > > >> > > > > > > > > > > critical patch of broker startup and
> hence
> > > > won't
> > > > > > > > >> impact
> > > > > > > > >> > the
> > > > > > > > >> > > > > > startup
> > > > > > > > >> > > > > > > > > time.
> > > > > > > > >> > > > > > > > > > > But it will impact the time when we
> could
> > > > start
> > > > > > > the
> > > > > > > > >> > > archival
> > > > > > > > >> > > > > > > process
> > > > > > > > >> > > > > > > > > since
> > > > > > > > >> > > > > > > > > > > the RLM thread pool will be blocked
> on the
> > > > first
> > > > > > > > call
> > > > > > > > >> to
> > > > > > > > >> > > > > > > > > > > listRemoteLogSegments(). To scan
> metadata
> > > > for
> > > > > > 1MM
> > > > > > > > >> > segments
> > > > > > > > >> > > > > > > (computed
> > > > > > > > >> > > > > > > > > above)
> > > > > > > > >> > > > > > > > > > > and transfer 1GB data over the network
> > > from
> > > > a
> > > > > > RLMM
> > > > > > > > >> such
> > > > > > > > >> > as
> > > > > > > > >> > > a
> > > > > > > > >> > > > > > remote
> > > > > > > > >> > > > > > > > > > > database would be in the order of
> minutes
> > > > > > > (depending
> > > > > > > > >> on
> > > > > > > > >> > how
> > > > > > > > >> > > > > > > efficient
> > > > > > > > >> > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > scan is with the RLMM implementation).
> > > > > > Although, I
> > > > > > > > >> would
> > > > > > > > >> > > > > concede
> > > > > > > > >> > > > > > > that
> > > > > > > > >> > > > > > > > > > > having RLM threads blocked for a few
> > > > minutes is
> > > > > > > > >> perhaps
> > > > > > > > >> > OK
> > > > > > > > >> > > > but
> > > > > > > > >> > > > > if
> > > > > > > > >> > > > > > > we
> > > > > > > > >> > > > > > > > > > > introduce the new API proposed in the
> KIP,
> > > > we
> > > > > > > would
> > > > > > > > >> have
> > > > > > > > >> > a
> > > > > > > > >> > > > > > > > > > > deterministic startup time for RLM.
> Adding
> > > > the
> > > > > > API
> > > > > > > > >> comes
> > > > > > > > >> > > at a
> > > > > > > > >> > > > > low
> > > > > > > > >> > > > > > > > cost
> > > > > > > > >> > > > > > > > > and
> > > > > > > > >> > > > > > > > > > > I believe the trade off is worth it.
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > About point#3:
> > > > > > > > >> > > > > > > > > > > We can use
> > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > >> > > > > > topicIdPartition,
> > > > > > > > >> > > > > > > > int
> > > > > > > > >> > > > > > > > > > > leaderEpoch) to calculate the segments
> > > > eligible
> > > > > > > for
> > > > > > > > >> > > deletion
> > > > > > > > >> > > > > > (based
> > > > > > > > >> > > > > > > > on
> > > > > > > > >> > > > > > > > > size
> > > > > > > > >> > > > > > > > > > > retention) where leader epoch(s)
> belong to
> > > > the
> > > > > > > > current
> > > > > > > > >> > > leader
> > > > > > > > >> > > > > > epoch
> > > > > > > > >> > > > > > > > > chain.
> > > > > > > > >> > > > > > > > > > > I understand that it may lead to
> segments
> > > > > > > belonging
> > > > > > > > to
> > > > > > > > >> > > other
> > > > > > > > >> > > > > > epoch
> > > > > > > > >> > > > > > > > > lineage
> > > > > > > > >> > > > > > > > > > > not getting deleted and would require
> a
> > > > separate
> > > > > > > > >> > mechanism
> > > > > > > > >> > > to
> > > > > > > > >> > > > > > > delete
> > > > > > > > >> > > > > > > > > them.
> > > > > > > > >> > > > > > > > > > > The separate mechanism would anyways
> be
> > > > required
> > > > > > > to
> > > > > > > > >> > delete
> > > > > > > > >> > > > > these
> > > > > > > > >> > > > > > > > > "leaked"
> > > > > > > > >> > > > > > > > > > > segments as there are other cases
> which
> > > > could
> > > > > > lead
> > > > > > > > to
> > > > > > > > >> > leaks
> > > > > > > > >> > > > > such
> > > > > > > > >> > > > > > as
> > > > > > > > >> > > > > > > > > network
> > > > > > > > >> > > > > > > > > > > problems with RSM mid way writing
> through.
> > > > > > segment
> > > > > > > > >> etc.
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > Thank you for the replies so far. They
> > > have
> > > > made
> > > > > > > me
> > > > > > > > >> > > re-think
> > > > > > > > >> > > > my
> > > > > > > > >> > > > > > > > > assumptions
> > > > > > > > >> > > > > > > > > > > and this dialogue has been very
> > > > constructive for
> > > > > > > me.
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > Regards,
> > > > > > > > >> > > > > > > > > > > Divij Vaidya
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun
> Rao
> > > > > > > > >> > > > > > <jun@confluent.io.invalid
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > Hi, Divij,
> > > > > > > > >> > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > Thanks for the reply.
> > > > > > > > >> > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > It's true that the data in Kafka
> could
> > > be
> > > > kept
> > > > > > > > >> longer
> > > > > > > > >> > > with
> > > > > > > > >> > > > > > > KIP-405.
> > > > > > > > >> > > > > > > > > How
> > > > > > > > >> > > > > > > > > > > > much data do you envision to have
> per
> > > > broker?
> > > > > > > For
> > > > > > > > >> 100TB
> > > > > > > > >> > > > data
> > > > > > > > >> > > > > > per
> > > > > > > > >> > > > > > > > > broker,
> > > > > > > > >> > > > > > > > > > > > with 1GB segment and segment
> metadata of
> > > > 100
> > > > > > > > bytes,
> > > > > > > > >> it
> > > > > > > > >> > > > > requires
> > > > > > > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should
> fit
> > > in
> > > > > > > memory.
> > > > > > > > >> > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > > > > > > > >> > listRemoteLogSegments()
> > > > > > > > >> > > > > > methods.
> > > > > > > > >> > > > > > > > > The one
> > > > > > > > >> > > > > > > > > > > > you listed
> > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > >> > > > > > > topicIdPartition,
> > > > > > > > >> > > > > > > > > int
> > > > > > > > >> > > > > > > > > > > > leaderEpoch) does return data in
> offset
> > > > order.
> > > > > > > > >> However,
> > > > > > > > >> > > the
> > > > > > > > >> > > > > > other
> > > > > > > > >> > > > > > > > > > > > one
> > > listRemoteLogSegments(TopicIdPartition
> > > > > > > > >> > > > topicIdPartition)
> > > > > > > > >> > > > > > > > doesn't
> > > > > > > > >> > > > > > > > > > > > specify the return order. I assume
> that
> > > > you
> > > > > > need
> > > > > > > > the
> > > > > > > > >> > > latter
> > > > > > > > >> > > > > to
> > > > > > > > >> > > > > > > > > calculate
> > > > > > > > >> > > > > > > > > > > > the segment size?
> > > > > > > > >> > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > Thanks,
> > > > > > > > >> > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > Jun
> > > > > > > > >> > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM
> Divij
> > > > Vaidya
> > > > > > <
> > > > > > > > >> > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > >> > > > > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > *Jun,*
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > *"the default implementation of
> RLMM
> > > > does
> > > > > > > local
> > > > > > > > >> > > caching,
> > > > > > > > >> > > > > > > right?"*
> > > > > > > > >> > > > > > > > > > > > > Yes, Jun. The default
> implementation
> > > of
> > > > RLMM
> > > > > > > > does
> > > > > > > > >> > > indeed
> > > > > > > > >> > > > > > cache
> > > > > > > > >> > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > segment
> > > > > > > > >> > > > > > > > > > > > > metadata today, hence, it won't
> work
> > > > for use
> > > > > > > > cases
> > > > > > > > >> > when
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > > > > number
> > > > > > > > >> > > > > > > > > of
> > > > > > > > >> > > > > > > > > > > > > segments in remote storage is
> large
> > > > enough
> > > > > > to
> > > > > > > > >> exceed
> > > > > > > > >> > > the
> > > > > > > > >> > > > > size
> > > > > > > > >> > > > > > > of
> > > > > > > > >> > > > > > > > > cache.
> > > > > > > > >> > > > > > > > > > > > As
> > > > > > > > >> > > > > > > > > > > > > part of this KIP, I will
> implement the
> > > > new
> > > > > > > > >> proposed
> > > > > > > > >> > API
> > > > > > > > >> > > > in
> > > > > > > > >> > > > > > the
> > > > > > > > >> > > > > > > > > default
> > > > > > > > >> > > > > > > > > > > > > implementation of RLMM but the
> > > > underlying
> > > > > > > > >> > > implementation
> > > > > > > > >> > > > > will
> > > > > > > > >> > > > > > > > > still be
> > > > > > > > >> > > > > > > > > > > a
> > > > > > > > >> > > > > > > > > > > > > scan. I will pick up optimizing
> that
> > > in
> > > > a
> > > > > > > > separate
> > > > > > > > >> > PR.
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > *"we also cache all segment
> metadata
> > > in
> > > > the
> > > > > > > > >> brokers
> > > > > > > > >> > > > without
> > > > > > > > >> > > > > > > > > KIP-405. Do
> > > > > > > > >> > > > > > > > > > > > you
> > > > > > > > >> > > > > > > > > > > > > see a need to change that?"*
> > > > > > > > >> > > > > > > > > > > > > Please correct me if I am wrong
> here
> > > > but we
> > > > > > > > cache
> > > > > > > > >> > > > metadata
> > > > > > > > >> > > > > > for
> > > > > > > > >> > > > > > > > > segments
> > > > > > > > >> > > > > > > > > > > > > "residing in local storage". The
> size
> > > > of the
> > > > > > > > >> current
> > > > > > > > >> > > > cache
> > > > > > > > >> > > > > > > works
> > > > > > > > >> > > > > > > > > fine
> > > > > > > > >> > > > > > > > > > > for
> > > > > > > > >> > > > > > > > > > > > > the scale of the number of
> segments
> > > > that we
> > > > > > > > >> expect to
> > > > > > > > >> > > > store
> > > > > > > > >> > > > > > in
> > > > > > > > >> > > > > > > > > local
> > > > > > > > >> > > > > > > > > > > > > storage. After KIP-405, that cache
> > > will
> > > > > > > continue
> > > > > > > > >> to
> > > > > > > > >> > > store
> > > > > > > > >> > > > > > > > metadata
> > > > > > > > >> > > > > > > > > for
> > > > > > > > >> > > > > > > > > > > > > segments which are residing in
> local
> > > > storage
> > > > > > > and
> > > > > > > > >> > hence,
> > > > > > > > >> > > > we
> > > > > > > > >> > > > > > > don't
> > > > > > > > >> > > > > > > > > need
> > > > > > > > >> > > > > > > > > > > to
> > > > > > > > >> > > > > > > > > > > > > change that. For segments which
> have
> > > > been
> > > > > > > > >> offloaded
> > > > > > > > >> > to
> > > > > > > > >> > > > > remote
> > > > > > > > >> > > > > > > > > storage,
> > > > > > > > >> > > > > > > > > > > it
> > > > > > > > >> > > > > > > > > > > > > would rely on RLMM. Note that the
> > > scale
> > > > of
> > > > > > > data
> > > > > > > > >> > stored
> > > > > > > > >> > > in
> > > > > > > > >> > > > > > RLMM
> > > > > > > > >> > > > > > > is
> > > > > > > > >> > > > > > > > > > > > different
> > > > > > > > >> > > > > > > > > > > > > from local cache because the
> number of
> > > > > > > segments
> > > > > > > > is
> > > > > > > > >> > > > expected
> > > > > > > > >> > > > > > to
> > > > > > > > >> > > > > > > be
> > > > > > > > >> > > > > > > > > much
> > > > > > > > >> > > > > > > > > > > > > larger than what current
> > > implementation
> > > > > > stores
> > > > > > > > in
> > > > > > > > >> > local
> > > > > > > > >> > > > > > > storage.
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > 2,3,4:
> > > > > > > > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > >> > > > > does
> > > > > > > > >> > > > > > > > > specify
> > > > > > > > >> > > > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > > order i.e. it returns the segments
> > > > sorted by
> > > > > > > > first
> > > > > > > > >> > > offset
> > > > > > > > >> > > > > in
> > > > > > > > >> > > > > > > > > ascending
> > > > > > > > >> > > > > > > > > > > > > order. I am copying the API docs
> for
> > > > KIP-405
> > > > > > > > here
> > > > > > > > >> for
> > > > > > > > >> > > > your
> > > > > > > > >> > > > > > > > > reference
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > *Returns iterator of remote log
> > > segment
> > > > > > > > metadata,
> > > > > > > > >> > > sorted
> > > > > > > > >> > > > by
> > > > > > > > >> > > > > > > > {@link
> > > > > > > > >> > > > > > > > > > > > >
> > > RemoteLogSegmentMetadata#startOffset()}
> > > > > > > > >> inascending
> > > > > > > > >> > > order
> > > > > > > > >> > > > > > which
> > > > > > > > >> > > > > > > > > > > contains
> > > > > > > > >> > > > > > > > > > > > > the given leader epoch. This is
> used
> > > by
> > > > > > remote
> > > > > > > > log
> > > > > > > > >> > > > > retention
> > > > > > > > >> > > > > > > > > management
> > > > > > > > >> > > > > > > > > > > > > subsystemto fetch the segment
> metadata
> > > > for a
> > > > > > > > given
> > > > > > > > >> > > leader
> > > > > > > > >> > > > > > > > > epoch.@param
> > > > > > > > >> > > > > > > > > > > > > topicIdPartition topic
> partition@param
> > > > > > > > >> leaderEpoch
> > > > > > > > >> > > > > > leader
> > > > > > > > >> > > > > > > > > > > > > epoch@return
> > > > > > > > >> > > > > > > > > > > > > Iterator of remote segments,
> sorted by
> > > > start
> > > > > > > > >> offset
> > > > > > > > >> > in
> > > > > > > > >> > > > > > > ascending
> > > > > > > > >> > > > > > > > > > > order. *
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > *Luke,*
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > 5. Note that we are trying to
> optimize
> > > > the
> > > > > > > > >> efficiency
> > > > > > > > >> > > of
> > > > > > > > >> > > > > size
> > > > > > > > >> > > > > > > > based
> > > > > > > > >> > > > > > > > > > > > > retention for remote storage.
> KIP-405
> > > > does
> > > > > > not
> > > > > > > > >> > > introduce
> > > > > > > > >> > > > a
> > > > > > > > >> > > > > > new
> > > > > > > > >> > > > > > > > > config
> > > > > > > > >> > > > > > > > > > > for
> > > > > > > > >> > > > > > > > > > > > > periodically checking remote
> similar
> > > to
> > > > > > > > >> > > > > > > > > > > log.retention.check.interval.ms
> > > > > > > > >> > > > > > > > > > > > > which is applicable for remote
> > > storage.
> > > > > > Hence,
> > > > > > > > the
> > > > > > > > >> > > metric
> > > > > > > > >> > > > > > will
> > > > > > > > >> > > > > > > be
> > > > > > > > >> > > > > > > > > > > updated
> > > > > > > > >> > > > > > > > > > > > > at the time of invoking log
> retention
> > > > check
> > > > > > > for
> > > > > > > > >> > remote
> > > > > > > > >> > > > tier
> > > > > > > > >> > > > > > > which
> > > > > > > > >> > > > > > > > > is
> > > > > > > > >> > > > > > > > > > > > > pending implementation today. We
> can
> > > > perhaps
> > > > > > > > come
> > > > > > > > >> > back
> > > > > > > > >> > > > and
> > > > > > > > >> > > > > > > update
> > > > > > > > >> > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > > metric description after the
> > > > implementation
> > > > > > of
> > > > > > > > log
> > > > > > > > >> > > > > retention
> > > > > > > > >> > > > > > > > check
> > > > > > > > >> > > > > > > > > in
> > > > > > > > >> > > > > > > > > > > > > RemoteLogManager.
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > --
> > > > > > > > >> > > > > > > > > > > > > Divij Vaidya
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM
> Luke
> > > > Chen <
> > > > > > > > >> > > > > showuon@gmail.com
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > Hi Divij,
> > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > One more question about the
> metric:
> > > > > > > > >> > > > > > > > > > > > > > I think the metric will be
> updated
> > > > when
> > > > > > > > >> > > > > > > > > > > > > > (1) each time we run the log
> > > retention
> > > > > > check
> > > > > > > > >> (that
> > > > > > > > >> > > is,
> > > > > > > > >> > > > > > > > > > > > > > log.retention.check.interval.ms
> )
> > > > > > > > >> > > > > > > > > > > > > > (2) When user explicitly call
> > > > > > > getRemoteLogSize
> > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > Is that correct?
> > > > > > > > >> > > > > > > > > > > > > > Maybe we should add a note in
> metric
> > > > > > > > >> description,
> > > > > > > > >> > > > > > otherwise,
> > > > > > > > >> > > > > > > > when
> > > > > > > > >> > > > > > > > > > > user
> > > > > > > > >> > > > > > > > > > > > > got,
> > > > > > > > >> > > > > > > > > > > > > > let's say 0 of
> RemoteLogSizeBytes,
> > > > will be
> > > > > > > > >> > surprised.
> > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > Otherwise, LGTM
> > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > Thank you for the KIP
> > > > > > > > >> > > > > > > > > > > > > > Luke
> > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM
> Jun
> > > > Rao
> > > > > > > > >> > > > > > > > <jun@confluent.io.invalid
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > Thanks for the explanation.
> > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > 1. Hmm, the default
> implementation
> > > > of
> > > > > > RLMM
> > > > > > > > >> does
> > > > > > > > >> > > local
> > > > > > > > >> > > > > > > > caching,
> > > > > > > > >> > > > > > > > > > > right?
> > > > > > > > >> > > > > > > > > > > > > > > Currently, we also cache all
> > > segment
> > > > > > > > metadata
> > > > > > > > >> in
> > > > > > > > >> > > the
> > > > > > > > >> > > > > > > brokers
> > > > > > > > >> > > > > > > > > > > without
> > > > > > > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need to
> > > change
> > > > > > that?
> > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation
> makes
> > > > > > sense.
> > > > > > > > >> > However,
> > > > > > > > >> > > > > > > > > > > > > > > currently,
> > > > > > > > >> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > >> > > > > > > > > doesn't
> > > > > > > > >> > > > > > > > > > > > > > specify
> > > > > > > > >> > > > > > > > > > > > > > > a particular order of the
> > > iterator.
> > > > Do
> > > > > > you
> > > > > > > > >> intend
> > > > > > > > >> > > to
> > > > > > > > >> > > > > > change
> > > > > > > > >> > > > > > > > > that?
> > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > Thanks,
> > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > Jun
> > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM
> > > Divij
> > > > > > > Vaidya
> > > > > > > > <
> > > > > > > > >> > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > Hey Jun
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > Thank you for your comments.
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor could
> > > ensure
> > > > > > that
> > > > > > > > >> > > > > > > > > listRemoteLogSegments()
> > > > > > > > >> > > > > > > > > > > is
> > > > > > > > >> > > > > > > > > > > > > > fast"*
> > > > > > > > >> > > > > > > > > > > > > > > > This would be ideal but
> > > > pragmatically,
> > > > > > > it
> > > > > > > > is
> > > > > > > > >> > > > > difficult
> > > > > > > > >> > > > > > to
> > > > > > > > >> > > > > > > > > ensure
> > > > > > > > >> > > > > > > > > > > > that
> > > > > > > > >> > > > > > > > > > > > > > > > listRemoteLogSegments() is
> fast.
> > > > This
> > > > > > is
> > > > > > > > >> > because
> > > > > > > > >> > > of
> > > > > > > > >> > > > > the
> > > > > > > > >> > > > > > > > > > > possibility
> > > > > > > > >> > > > > > > > > > > > > of
> > > > > > > > >> > > > > > > > > > > > > > a
> > > > > > > > >> > > > > > > > > > > > > > > > large number of segments
> (much
> > > > larger
> > > > > > > than
> > > > > > > > >> what
> > > > > > > > >> > > > Kafka
> > > > > > > > >> > > > > > > > > currently
> > > > > > > > >> > > > > > > > > > > > > handles
> > > > > > > > >> > > > > > > > > > > > > > > > with local storage today)
> would
> > > > make
> > > > > > it
> > > > > > > > >> > > infeasible
> > > > > > > > >> > > > to
> > > > > > > > >> > > > > > > adopt
> > > > > > > > >> > > > > > > > > > > > > strategies
> > > > > > > > >> > > > > > > > > > > > > > > such
> > > > > > > > >> > > > > > > > > > > > > > > > as local caching to improve
> the
> > > > > > > > performance
> > > > > > > > >> of
> > > > > > > > >> > > > > > > > > > > > listRemoteLogSegments.
> > > > > > > > >> > > > > > > > > > > > > > > Apart
> > > > > > > > >> > > > > > > > > > > > > > > > from caching (which won't
> work
> > > > due to
> > > > > > > size
> > > > > > > > >> > > > > > limitations) I
> > > > > > > > >> > > > > > > > > can't
> > > > > > > > >> > > > > > > > > > > > think
> > > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > > >> > > > > > > > > > > > > > > > other strategies which may
> > > > eliminate
> > > > > > the
> > > > > > > > >> need
> > > > > > > > >> > for
> > > > > > > > >> > > > IO
> > > > > > > > >> > > > > > > > > > > > > > > > operations proportional to
> the
> > > > number
> > > > > > of
> > > > > > > > >> total
> > > > > > > > >> > > > > > segments.
> > > > > > > > >> > > > > > > > > Please
> > > > > > > > >> > > > > > > > > > > > > advise
> > > > > > > > >> > > > > > > > > > > > > > if
> > > > > > > > >> > > > > > > > > > > > > > > > you have something in mind.
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds
> the
> > > > > > retention
> > > > > > > > >> size,
> > > > > > > > >> > we
> > > > > > > > >> > > > need
> > > > > > > > >> > > > > > to
> > > > > > > > >> > > > > > > > > > > determine
> > > > > > > > >> > > > > > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > > > > > subset of segments to
> delete to
> > > > bring
> > > > > > > the
> > > > > > > > >> size
> > > > > > > > >> > > > within
> > > > > > > > >> > > > > > the
> > > > > > > > >> > > > > > > > > > > retention
> > > > > > > > >> > > > > > > > > > > > > > > limit.
> > > > > > > > >> > > > > > > > > > > > > > > > Do we need to call
> > > > > > > > >> > > > > > > > > > >
> > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > >> > > > > > > > > > > > > to
> > > > > > > > >> > > > > > > > > > > > > > > > determine that?"*
> > > > > > > > >> > > > > > > > > > > > > > > > Yes, we need to call
> > > > > > > > >> listRemoteLogSegments() to
> > > > > > > > >> > > > > > determine
> > > > > > > > >> > > > > > > > > which
> > > > > > > > >> > > > > > > > > > > > > > segments
> > > > > > > > >> > > > > > > > > > > > > > > > should be deleted. But
> there is
> > > a
> > > > > > > > difference
> > > > > > > > >> > with
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > > use
> > > > > > > > >> > > > > > > > > case we
> > > > > > > > >> > > > > > > > > > > > are
> > > > > > > > >> > > > > > > > > > > > > > > > trying to optimize with this
> > > KIP.
> > > > To
> > > > > > > > >> determine
> > > > > > > > >> > > the
> > > > > > > > >> > > > > > subset
> > > > > > > > >> > > > > > > > of
> > > > > > > > >> > > > > > > > > > > > segments
> > > > > > > > >> > > > > > > > > > > > > > > which
> > > > > > > > >> > > > > > > > > > > > > > > > would be deleted, we only
> read
> > > > > > metadata
> > > > > > > > for
> > > > > > > > >> > > > segments
> > > > > > > > >> > > > > > > which
> > > > > > > > >> > > > > > > > > would
> > > > > > > > >> > > > > > > > > > > be
> > > > > > > > >> > > > > > > > > > > > > > > deleted
> > > > > > > > >> > > > > > > > > > > > > > > > via the
> listRemoteLogSegments().
> > > > But
> > > > > > to
> > > > > > > > >> > determine
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > > > > > > > totalLogSize,
> > > > > > > > >> > > > > > > > > > > > > > which
> > > > > > > > >> > > > > > > > > > > > > > > > is required every time
> retention
> > > > logic
> > > > > > > > >> based on
> > > > > > > > >> > > > size
> > > > > > > > >> > > > > > > > > executes, we
> > > > > > > > >> > > > > > > > > > > > > read
> > > > > > > > >> > > > > > > > > > > > > > > > metadata of *all* the
> segments
> > > in
> > > > > > remote
> > > > > > > > >> > storage.
> > > > > > > > >> > > > > > Hence,
> > > > > > > > >> > > > > > > > the
> > > > > > > > >> > > > > > > > > > > number
> > > > > > > > >> > > > > > > > > > > > > of
> > > > > > > > >> > > > > > > > > > > > > > > > results returned by
> > > > > > > > >> > > > > > > > > > > >
> > > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > >> > > > > > > > > > > > > > *is
> > > > > > > > >> > > > > > > > > > > > > > > > different when we are
> > > calculating
> > > > > > > > >> totalLogSize
> > > > > > > > >> > > vs.
> > > > > > > > >> > > > > when
> > > > > > > > >> > > > > > > we
> > > > > > > > >> > > > > > > > > are
> > > > > > > > >> > > > > > > > > > > > > > > determining
> > > > > > > > >> > > > > > > > > > > > > > > > the subset of segments to
> > > delete.
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > 3.
> > > > > > > > >> > > > > > > > > > > > > > > > *"Also, what about
> time-based
> > > > > > retention?
> > > > > > > > To
> > > > > > > > >> > make
> > > > > > > > >> > > > that
> > > > > > > > >> > > > > > > > > efficient,
> > > > > > > > >> > > > > > > > > > > do
> > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > > >> > > > > > > > > > > > > > > > to make some additional
> > > interface
> > > > > > > > >> changes?"*No.
> > > > > > > > >> > > > Note
> > > > > > > > >> > > > > > that
> > > > > > > > >> > > > > > > > > time
> > > > > > > > >> > > > > > > > > > > > > > complexity
> > > > > > > > >> > > > > > > > > > > > > > > > to determine the segments
> for
> > > > > > retention
> > > > > > > is
> > > > > > > > >> > > > different
> > > > > > > > >> > > > > > for
> > > > > > > > >> > > > > > > > time
> > > > > > > > >> > > > > > > > > > > based
> > > > > > > > >> > > > > > > > > > > > > vs.
> > > > > > > > >> > > > > > > > > > > > > > > > size based. For time based,
> the
> > > > time
> > > > > > > > >> complexity
> > > > > > > > >> > > is
> > > > > > > > >> > > > a
> > > > > > > > >> > > > > > > > > function of
> > > > > > > > >> > > > > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > > > > number
> > > > > > > > >> > > > > > > > > > > > > > > > of segments which are
> "eligible
> > > > for
> > > > > > > > >> deletion"
> > > > > > > > >> > > > (since
> > > > > > > > >> > > > > we
> > > > > > > > >> > > > > > > > only
> > > > > > > > >> > > > > > > > > read
> > > > > > > > >> > > > > > > > > > > > > > > metadata
> > > > > > > > >> > > > > > > > > > > > > > > > for segments which would be
> > > > deleted)
> > > > > > > > >> whereas in
> > > > > > > > >> > > > size
> > > > > > > > >> > > > > > > based
> > > > > > > > >> > > > > > > > > > > > retention,
> > > > > > > > >> > > > > > > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > > > > > time complexity is a
> function of
> > > > "all
> > > > > > > > >> segments"
> > > > > > > > >> > > > > > available
> > > > > > > > >> > > > > > > > in
> > > > > > > > >> > > > > > > > > > > remote
> > > > > > > > >> > > > > > > > > > > > > > > storage
> > > > > > > > >> > > > > > > > > > > > > > > > (metadata of all segments
> needs
> > > > to be
> > > > > > > read
> > > > > > > > >> to
> > > > > > > > >> > > > > calculate
> > > > > > > > >> > > > > > > the
> > > > > > > > >> > > > > > > > > total
> > > > > > > > >> > > > > > > > > > > > > > size).
> > > > > > > > >> > > > > > > > > > > > > > > As
> > > > > > > > >> > > > > > > > > > > > > > > > you may observe, this KIP
> will
> > > > bring
> > > > > > the
> > > > > > > > >> time
> > > > > > > > >> > > > > > complexity
> > > > > > > > >> > > > > > > > for
> > > > > > > > >> > > > > > > > > both
> > > > > > > > >> > > > > > > > > > > > > time
> > > > > > > > >> > > > > > > > > > > > > > > > based retention & size based
> > > > retention
> > > > > > > to
> > > > > > > > >> the
> > > > > > > > >> > > same
> > > > > > > > >> > > > > > > > function.
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > 4. Also, please note that
> this
> > > > new API
> > > > > > > > >> > introduced
> > > > > > > > >> > > > in
> > > > > > > > >> > > > > > this
> > > > > > > > >> > > > > > > > KIP
> > > > > > > > >> > > > > > > > > > > also
> > > > > > > > >> > > > > > > > > > > > > > > enables
> > > > > > > > >> > > > > > > > > > > > > > > > us to provide a metric for
> total
> > > > size
> > > > > > of
> > > > > > > > >> data
> > > > > > > > >> > > > stored
> > > > > > > > >> > > > > in
> > > > > > > > >> > > > > > > > > remote
> > > > > > > > >> > > > > > > > > > > > > storage.
> > > > > > > > >> > > > > > > > > > > > > > > > Without the API,
> calculation of
> > > > this
> > > > > > > > metric
> > > > > > > > >> > will
> > > > > > > > >> > > > > become
> > > > > > > > >> > > > > > > > very
> > > > > > > > >> > > > > > > > > > > > > expensive
> > > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > > >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> > > > > > > > >> > > > > > > > > > > > > > > > I understand that your
> > > motivation
> > > > here
> > > > > > > is
> > > > > > > > to
> > > > > > > > >> > > avoid
> > > > > > > > >> > > > > > > > polluting
> > > > > > > > >> > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > > > > interface
> > > > > > > > >> > > > > > > > > > > > > > > > with optimization specific
> APIs
> > > > and I
> > > > > > > will
> > > > > > > > >> > agree
> > > > > > > > >> > > > with
> > > > > > > > >> > > > > > > that
> > > > > > > > >> > > > > > > > > goal.
> > > > > > > > >> > > > > > > > > > > > But
> > > > > > > > >> > > > > > > > > > > > > I
> > > > > > > > >> > > > > > > > > > > > > > > > believe that this new API
> > > > proposed in
> > > > > > > the
> > > > > > > > >> KIP
> > > > > > > > >> > > > brings
> > > > > > > > >> > > > > in
> > > > > > > > >> > > > > > > > > > > significant
> > > > > > > > >> > > > > > > > > > > > > > > > improvement and there is no
> > > other
> > > > work
> > > > > > > > >> around
> > > > > > > > >> > > > > available
> > > > > > > > >> > > > > > > to
> > > > > > > > >> > > > > > > > > > > achieve
> > > > > > > > >> > > > > > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > > > > same
> > > > > > > > >> > > > > > > > > > > > > > > > performance.
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > Regards,
> > > > > > > > >> > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at
> 12:12 AM
> > > > Jun
> > > > > > Rao
> > > > > > > > >> > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > >> > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry
> for
> > > > the
> > > > > > late
> > > > > > > > >> reply.
> > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > The motivation of the KIP
> is
> > > to
> > > > > > > improve
> > > > > > > > >> the
> > > > > > > > >> > > > > > efficiency
> > > > > > > > >> > > > > > > of
> > > > > > > > >> > > > > > > > > size
> > > > > > > > >> > > > > > > > > > > > > based
> > > > > > > > >> > > > > > > > > > > > > > > > > retention. I am not sure
> the
> > > > > > proposed
> > > > > > > > >> changes
> > > > > > > > >> > > are
> > > > > > > > >> > > > > > > enough.
> > > > > > > > >> > > > > > > > > For
> > > > > > > > >> > > > > > > > > > > > > > example,
> > > > > > > > >> > > > > > > > > > > > > > > if
> > > > > > > > >> > > > > > > > > > > > > > > > > the size exceeds the
> retention
> > > > size,
> > > > > > > we
> > > > > > > > >> need
> > > > > > > > >> > to
> > > > > > > > >> > > > > > > determine
> > > > > > > > >> > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > > subset
> > > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > > >> > > > > > > > > > > > > > > > > segments to delete to
> bring
> > > the
> > > > size
> > > > > > > > >> within
> > > > > > > > >> > the
> > > > > > > > >> > > > > > > retention
> > > > > > > > >> > > > > > > > > > > limit.
> > > > > > > > >> > > > > > > > > > > > Do
> > > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > > >> > > > > > > > > > > > > > > > need
> > > > > > > > >> > > > > > > > > > > > > > > > > to call
> > > > > > > > >> > > > > > >
> RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > >> > > > > > > > to
> > > > > > > > >> > > > > > > > > > > > > determine
> > > > > > > > >> > > > > > > > > > > > > > > > that?
> > > > > > > > >> > > > > > > > > > > > > > > > > Also, what about
> time-based
> > > > > > retention?
> > > > > > > > To
> > > > > > > > >> > make
> > > > > > > > >> > > > that
> > > > > > > > >> > > > > > > > > efficient,
> > > > > > > > >> > > > > > > > > > > do
> > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > > >> > > > > > > > > > > > > > > > > to make some additional
> > > > interface
> > > > > > > > changes?
> > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > An alternative approach
> is for
> > > > the
> > > > > > > RLMM
> > > > > > > > >> > > > implementor
> > > > > > > > >> > > > > > to
> > > > > > > > >> > > > > > > > make
> > > > > > > > >> > > > > > > > > > > sure
> > > > > > > > >> > > > > > > > > > > > > > > > > that
> > > > > > > > >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > >> > > > > > > is
> > > > > > > > >> > > > > > > > > fast
> > > > > > > > >> > > > > > > > > > > > > (e.g.,
> > > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > > >> > > > > > > > > > > > > > > > > local caching). This way,
> we
> > > > could
> > > > > > > keep
> > > > > > > > >> the
> > > > > > > > >> > > > > interface
> > > > > > > > >> > > > > > > > > simple.
> > > > > > > > >> > > > > > > > > > > > Have
> > > > > > > > >> > > > > > > > > > > > > we
> > > > > > > > >> > > > > > > > > > > > > > > > > considered that?
> > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > Jun
> > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at
> 6:28
> > > AM
> > > > > > Divij
> > > > > > > > >> Vaidya
> > > > > > > > >> > <
> > > > > > > > >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > Does anyone else have
> any
> > > > thoughts
> > > > > > > on
> > > > > > > > >> this
> > > > > > > > >> > > > > before I
> > > > > > > > >> > > > > > > > > propose
> > > > > > > > >> > > > > > > > > > > > this
> > > > > > > > >> > > > > > > > > > > > > > for
> > > > > > > > >> > > > > > > > > > > > > > > a
> > > > > > > > >> > > > > > > > > > > > > > > > > > vote?
> > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > --
> > > > > > > > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at
> 12:57
> > > > PM
> > > > > > > Satish
> > > > > > > > >> > > Duggana
> > > > > > > > >> > > > <
> > > > > > > > >> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP
> Divij!
> > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > This is a nice
> improvement
> > > > to
> > > > > > > avoid
> > > > > > > > >> > > > > recalculation
> > > > > > > > >> > > > > > > of
> > > > > > > > >> > > > > > > > > size.
> > > > > > > > >> > > > > > > > > > > > > > > Customized
> > > > > > > > >> > > > > > > > > > > > > > > > > > RLMMs
> > > > > > > > >> > > > > > > > > > > > > > > > > > > can implement the best
> > > > possible
> > > > > > > > >> approach
> > > > > > > > >> > by
> > > > > > > > >> > > > > > caching
> > > > > > > > >> > > > > > > > or
> > > > > > > > >> > > > > > > > > > > > > > maintaining
> > > > > > > > >> > > > > > > > > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > > > > > > > size
> > > > > > > > >> > > > > > > > > > > > > > > > > > > in an efficient way.
> But
> > > > this is
> > > > > > > > not a
> > > > > > > > >> > big
> > > > > > > > >> > > > > > concern
> > > > > > > > >> > > > > > > > for
> > > > > > > > >> > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > > > default
> > > > > > > > >> > > > > > > > > > > > > > > > > topic
> > > > > > > > >> > > > > > > > > > > > > > > > > > > based RLMM as
> mentioned in
> > > > the
> > > > > > > KIP.
> > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at
> > > > 18:48,
> > > > > > > Divij
> > > > > > > > >> > Vaidya
> > > > > > > > >> > > <
> > > > > > > > >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > > >> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > Thank you for your
> > > review
> > > > > > Luke.
> > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that
> would the
> > > > new
> > > > > > > > >> > > > > > `RemoteLogSizeBytes`
> > > > > > > > >> > > > > > > > > metric
> > > > > > > > >> > > > > > > > > > > > be a
> > > > > > > > >> > > > > > > > > > > > > > > > > > performance
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > overhead? Although
> we
> > > > move the
> > > > > > > > >> > > calculation
> > > > > > > > >> > > > > to a
> > > > > > > > >> > > > > > > > > seperate
> > > > > > > > >> > > > > > > > > > > > API,
> > > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > > >> > > > > > > > > > > > > > > > > still
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > can't assume users
> will
> > > > > > > implement
> > > > > > > > a
> > > > > > > > >> > > > > > light-weight
> > > > > > > > >> > > > > > > > > method,
> > > > > > > > >> > > > > > > > > > > > > right?
> > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > This metric would be
> > > > logged
> > > > > > > using
> > > > > > > > >> the
> > > > > > > > >> > > > > > information
> > > > > > > > >> > > > > > > > > that is
> > > > > > > > >> > > > > > > > > > > > > > already
> > > > > > > > >> > > > > > > > > > > > > > > > > being
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > calculated for
> handling
> > > > remote
> > > > > > > > >> > retention
> > > > > > > > >> > > > > logic,
> > > > > > > > >> > > > > > > > > hence, no
> > > > > > > > >> > > > > > > > > > > > > > > > additional
> > > > > > > > >> > > > > > > > > > > > > > > > > > work
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > is required to
> calculate
> > > > this
> > > > > > > > >> metric.
> > > > > > > > >> > > More
> > > > > > > > >> > > > > > > > > specifically,
> > > > > > > > >> > > > > > > > > > > > > > whenever
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager
> calls
> > > > > > > > >> getRemoteLogSize
> > > > > > > > >> > > > API,
> > > > > > > > >> > > > > > this
> > > > > > > > >> > > > > > > > > metric
> > > > > > > > >> > > > > > > > > > > > > would
> > > > > > > > >> > > > > > > > > > > > > > be
> > > > > > > > >> > > > > > > > > > > > > > > > > > > captured.
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > This API call is
> made
> > > > every
> > > > > > time
> > > > > > > > >> > > > > > RemoteLogManager
> > > > > > > > >> > > > > > > > > wants
> > > > > > > > >> > > > > > > > > > > to
> > > > > > > > >> > > > > > > > > > > > > > handle
> > > > > > > > >> > > > > > > > > > > > > > > > > > expired
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > remote log segments
> > > (which
> > > > > > > should
> > > > > > > > be
> > > > > > > > >> > > > > periodic).
> > > > > > > > >> > > > > > > > Does
> > > > > > > > >> > > > > > > > > that
> > > > > > > > >> > > > > > > > > > > > > > address
> > > > > > > > >> > > > > > > > > > > > > > > > > your
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > concern?
> > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12,
> 2022 at
> > > > 11:01
> > > > > > AM
> > > > > > > > >> Luke
> > > > > > > > >> > > Chen
> > > > > > > > >> > > > <
> > > > > > > > >> > > > > > > > > > > > > showuon@gmail.com>
> > > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the
> KIP!
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > I think it makes
> sense
> > > > to
> > > > > > > > delegate
> > > > > > > > >> > the
> > > > > > > > >> > > > > > > > > responsibility
> > > > > > > > >> > > > > > > > > > > of
> > > > > > > > >> > > > > > > > > > > > > > > > > calculation
> > > > > > > > >> > > > > > > > > > > > > > > > > > to
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > specific
> > > > > > > > RemoteLogMetadataManager
> > > > > > > > >> > > > > > > implementation.
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > But one thing I'm
> not
> > > > quite
> > > > > > > > sure,
> > > > > > > > >> is
> > > > > > > > >> > > that
> > > > > > > > >> > > > > > would
> > > > > > > > >> > > > > > > > > the new
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> `RemoteLogSizeBytes`
> > > > metric
> > > > > > > be a
> > > > > > > > >> > > > > performance
> > > > > > > > >> > > > > > > > > overhead?
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Although we move
> the
> > > > > > > calculation
> > > > > > > > >> to a
> > > > > > > > >> > > > > > seperate
> > > > > > > > >> > > > > > > > > API, we
> > > > > > > > >> > > > > > > > > > > > > still
> > > > > > > > >> > > > > > > > > > > > > > > > can't
> > > > > > > > >> > > > > > > > > > > > > > > > > > > assume
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > users will
> implement a
> > > > > > > > >> light-weight
> > > > > > > > >> > > > method,
> > > > > > > > >> > > > > > > > right?
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Luke
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1,
> 2022 at
> > > > 5:47
> > > > > > PM
> > > > > > > > >> Divij
> > > > > > > > >> > > > > Vaidya <
> > > > > > > > >> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Please take a
> look
> > > at
> > > > this
> > > > > > > KIP
> > > > > > > > >> > which
> > > > > > > > >> > > > > > proposes
> > > > > > > > >> > > > > > > > an
> > > > > > > > >> > > > > > > > > > > > > extension
> > > > > > > > >> > > > > > > > > > > > > > to
> > > > > > > > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > This
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > is my first KIP
> with
> > > > > > Apache
> > > > > > > > >> Kafka
> > > > > > > > >> > > > > community
> > > > > > > > >> > > > > > > so
> > > > > > > > >> > > > > > > > > any
> > > > > > > > >> > > > > > > > > > > > > feedback
> > > > > > > > >> > > > > > > > > > > > > > > > would
> > > > > > > > >> > > > > > > > > > > > > > > > > > be
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > highly
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software
> > > Engineer
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > > >
> > > > > > > > >> > > > > > > > > > >
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Satish Duggana <sa...@gmail.com>.
Thanks Divij for taking the feedback and updating the motivation
section in the KIP.

One more comment on Alternative solution-3, The con is not valid as
that will not affect the broker restart times as discussed in the
earlier email in this thread. You may want to update that.

~Satish.

On Sun, 2 Jul 2023 at 01:03, Divij Vaidya <di...@gmail.com> wrote:
>
> Thank you folks for reviewing this KIP.
>
> Satish, I have modified the motivation to make it more clear. Now it says,
> "Since the main feature of tiered storage is storing a large amount of
> data, we expect num_remote_segments to be large. A frequent linear scan
> (i.e. listing all segment metadata) could be expensive/slower because of
> the underlying storage used by RemoteLogMetadataManager. This slowness to
> list all segment metadata could result in the loss of availability...."
>
> Jun, Kamal, Satish, if you don't have any further concerns, I would
> appreciate a vote for this KIP in the voting thread -
> https://lists.apache.org/thread/soz00990gvzodv7oyqj4ysvktrqy6xfk
>
> --
> Divij Vaidya
>
>
>
> On Sat, Jul 1, 2023 at 6:16 AM Kamal Chandraprakash <
> kamal.chandraprakash@gmail.com> wrote:
>
> > Hi Divij,
> >
> > Thanks for the explanation. LGTM.
> >
> > --
> > Kamal
> >
> > On Sat, Jul 1, 2023 at 7:28 AM Satish Duggana <sa...@gmail.com>
> > wrote:
> >
> > > Hi Divij,
> > > I am fine with having an API to compute the size as I mentioned in my
> > > earlier reply in this mail thread. But I have the below comment for
> > > the motivation for this KIP.
> > >
> > > As you discussed offline, the main issue here is listing calls for
> > > remote log segment metadata is slower because of the storage used for
> > > RLMM. These can be avoided with this new API.
> > >
> > > Please add this in the motivation section as it is one of the main
> > > motivations for the KIP.
> > >
> > > Thanks,
> > > Satish.
> > >
> > > On Sat, 1 Jul 2023 at 01:43, Jun Rao <ju...@confluent.io.invalid> wrote:
> > > >
> > > > Hi, Divij,
> > > >
> > > > Sorry for the late reply.
> > > >
> > > > Given your explanation, the new API sounds reasonable to me. Is that
> > > enough
> > > > to build the external metadata layer for the remote segments or do you
> > > need
> > > > some additional API changes?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <di...@gmail.com>
> > > wrote:
> > > >
> > > > > Thank you for looking into this Kamal.
> > > > >
> > > > > You are right in saying that a cold start (i.e. leadership failover
> > or
> > > > > broker startup) does not impact the broker startup duration. But it
> > > does
> > > > > have the following impact:
> > > > > 1. It leads to a burst of full-scan requests to RLMM in case multiple
> > > > > leadership failovers occur at the same time. Even if the RLMM
> > > > > implementation has the capability to serve the total size from an
> > index
> > > > > (and hence handle this burst), we wouldn't be able to use it since
> > the
> > > > > current API necessarily calls for a full scan.
> > > > > 2. The archival (copying of data to tiered storage) process will
> > have a
> > > > > delayed start. The delayed start of archival could lead to local
> > build
> > > up
> > > > > of data which may lead to disk full.
> > > > >
> > > > > The disadvantage of adding this new API is that every provider will
> > > have to
> > > > > implement it, agreed. But I believe that this tradeoff is worthwhile
> > > since
> > > > > the default implementation could be the same as you mentioned, i.e.
> > > keeping
> > > > > cumulative in-memory count.
> > > > >
> > > > > --
> > > > > Divij Vaidya
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
> > > > > kamal.chandraprakash@gmail.com> wrote:
> > > > >
> > > > > > Hi Divij,
> > > > > >
> > > > > > Thanks for the KIP! Sorry for the late reply.
> > > > > >
> > > > > > Can you explain the rejected alternative-3?
> > > > > > Store the cumulative size of remote tier log in-memory at
> > > > > RemoteLogManager
> > > > > > "*Cons*: Every time a broker starts-up, it will scan through all
> > the
> > > > > > segments in the remote tier to initialise the in-memory value. This
> > > would
> > > > > > increase the broker start-up time."
> > > > > >
> > > > > > Keeping the source of truth to determine the remote-log-size in the
> > > > > leader
> > > > > > would be consistent across different implementations of the plugin.
> > > The
> > > > > > concern posted in the KIP is that we are calculating the
> > > remote-log-size
> > > > > on
> > > > > > each iteration of the cleaner thread (say 5 mins). If we calculate
> > > only
> > > > > > once during broker startup or during the leadership reassignment,
> > do
> > > we
> > > > > > still need the cache?
> > > > > >
> > > > > > The broker startup-time won't be affected by the remote log manager
> > > > > > initialisation. The broker continue to start accepting the new
> > > > > > produce/fetch requests, while the RLM thread in the background can
> > > > > > determine the remote-log-size once and start copying/deleting the
> > > > > segments.
> > > > > >
> > > > > > Thanks,
> > > > > > Kamal
> > > > > >
> > > > > > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <
> > divijvaidya13@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Satish / Jun
> > > > > > >
> > > > > > > Do you have any thoughts on this?
> > > > > > >
> > > > > > > --
> > > > > > > Divij Vaidya
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <
> > > divijvaidya13@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Jun
> > > > > > > >
> > > > > > > > It has been a while since this KIP got some attention. While we
> > > wait
> > > > > > for
> > > > > > > > Satish to chime in here, perhaps I can answer your question.
> > > > > > > >
> > > > > > > > > Could you explain how you exposed the log size in your
> > KIP-405
> > > > > > > > implementation?
> > > > > > > >
> > > > > > > > The APIs available in RLMM as per KIP405
> > > > > > > > are, addRemoteLogSegmentMetadata(),
> > > updateRemoteLogSegmentMetadata(),
> > > > > > > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > > > > > > putRemotePartitionDeleteMetadata(), listRemoteLogSegments(),
> > > > > > > onPartitionLeadershipChanges()
> > > > > > > > and onStopPartitions(). None of these APIs allow us to expose
> > > the log
> > > > > > > size,
> > > > > > > > hence, the only option that remains is to list all segments
> > using
> > > > > > > > listRemoteLogSegments() and aggregate them every time we
> > require
> > > to
> > > > > > > > calculate the size. Based on our prior discussion, this
> > requires
> > > > > > reading
> > > > > > > > all segment metadata which won't work for non-local RLMM
> > > > > > implementations.
> > > > > > > > Satish's implementation also performs a full scan and
> > calculates
> > > the
> > > > > > > > aggregate. see:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > > > > > > >
> > > > > > > >
> > > > > > > > Does this answer your question?
> > > > > > > >
> > > > > > > > --
> > > > > > > > Divij Vaidya
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao
> > <jun@confluent.io.invalid
> > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi, Divij,
> > > > > > > >>
> > > > > > > >> Thanks for the explanation.
> > > > > > > >>
> > > > > > > >> Good question.
> > > > > > > >>
> > > > > > > >> Hi, Satish,
> > > > > > > >>
> > > > > > > >> Could you explain how you exposed the log size in your KIP-405
> > > > > > > >> implementation?
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >>
> > > > > > > >> Jun
> > > > > > > >>
> > > > > > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <
> > > > > divijvaidya13@gmail.com
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hey Jun
> > > > > > > >> >
> > > > > > > >> > Yes, it is possible to maintain the log size in the cache
> > (see
> > > > > > > rejected
> > > > > > > >> > alternative#3 in the KIP) but I did not understand how it is
> > > > > > possible
> > > > > > > to
> > > > > > > >> > retrieve it without the new API. The log size could be
> > > calculated
> > > > > on
> > > > > > > >> > startup by scanning through the segments (though I would
> > > disagree
> > > > > > that
> > > > > > > >> this
> > > > > > > >> > is the right approach since scanning itself takes order of
> > > minutes
> > > > > > and
> > > > > > > >> > hence delay the start of archive process), and incrementally
> > > > > > > maintained
> > > > > > > >> > afterwards, even then, we would need an API in
> > > > > > > RemoteLogMetadataManager
> > > > > > > >> so
> > > > > > > >> > that RLM could fetch the cached size!
> > > > > > > >> >
> > > > > > > >> > If we wish to cache the size without adding a new API, then
> > we
> > > > > need
> > > > > > to
> > > > > > > >> > cache the size in RLM itself (instead of RLMM
> > implementation)
> > > and
> > > > > > > >> > incrementally manage it. The downside of longer archive time
> > > at
> > > > > > > startup
> > > > > > > >> > (due to initial scale) still remains valid in this
> > situation.
> > > > > > > >> >
> > > > > > > >> > --
> > > > > > > >> > Divij Vaidya
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao
> > > <jun@confluent.io.invalid
> > > > > >
> > > > > > > >> wrote:
> > > > > > > >> >
> > > > > > > >> > > Hi, Divij,
> > > > > > > >> > >
> > > > > > > >> > > Thanks for the explanation.
> > > > > > > >> > >
> > > > > > > >> > > If there is in-memory cache, could we maintain the log
> > size
> > > in
> > > > > the
> > > > > > > >> cache
> > > > > > > >> > > with the existing API? For example, a replica could make a
> > > > > > > >> > > listRemoteLogSegments(TopicIdPartition topicIdPartition)
> > > call on
> > > > > > > >> startup
> > > > > > > >> > to
> > > > > > > >> > > get the remote segment size before the current
> > leaderEpoch.
> > > The
> > > > > > > leader
> > > > > > > >> > > could then maintain the size incrementally afterwards. On
> > > leader
> > > > > > > >> change,
> > > > > > > >> > > other replicas can make a
> > > listRemoteLogSegments(TopicIdPartition
> > > > > > > >> > > topicIdPartition, int leaderEpoch) call to get the size of
> > > newly
> > > > > > > >> > generated
> > > > > > > >> > > segments.
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > >
> > > > > > > >> > > Jun
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > > > > > > divijvaidya13@gmail.com
> > > > > > > >> >
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > > Is the new method enough for doing size-based
> > retention?
> > > > > > > >> > > >
> > > > > > > >> > > > Yes. You are right in assuming that this API only
> > > provides the
> > > > > > > >> Remote
> > > > > > > >> > > > storage size (for current epoch chain). We would use
> > this
> > > API
> > > > > > for
> > > > > > > >> size
> > > > > > > >> > > > based retention along with a value of
> > > localOnlyLogSegmentSize
> > > > > > > which
> > > > > > > >> is
> > > > > > > >> > > > computed as
> > > Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > > > > > >> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size =
> > > > > > > >> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I
> > have
> > > > > > updated
> > > > > > > >> the
> > > > > > > >> > KIP
> > > > > > > >> > > > with this information. You can also check an example
> > > > > > > implementation
> > > > > > > >> at
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > > > Do you imagine all accesses to remote metadata will be
> > > > > across
> > > > > > > the
> > > > > > > >> > > network
> > > > > > > >> > > > or will there be some local in-memory cache?
> > > > > > > >> > > >
> > > > > > > >> > > > I would expect a disk-less implementation to maintain a
> > > finite
> > > > > > > >> > in-memory
> > > > > > > >> > > > cache for segment metadata to optimize the number of
> > > network
> > > > > > calls
> > > > > > > >> made
> > > > > > > >> > > to
> > > > > > > >> > > > fetch the data. In future, we can think about bringing
> > > this
> > > > > > finite
> > > > > > > >> size
> > > > > > > >> > > > cache into RLM itself but that's probably a conversation
> > > for a
> > > > > > > >> > different
> > > > > > > >> > > > KIP. There are many other things we would like to do to
> > > > > optimize
> > > > > > > the
> > > > > > > >> > > Tiered
> > > > > > > >> > > > storage interface such as introducing a circular buffer
> > /
> > > > > > > streaming
> > > > > > > >> > > > interface from RSM (so that we don't have to wait to
> > > fetch the
> > > > > > > >> entire
> > > > > > > >> > > > segment before starting to send records to the
> > consumer),
> > > > > > caching
> > > > > > > >> the
> > > > > > > >> > > > segments fetched from RSM locally (I would assume all
> > RSM
> > > > > plugin
> > > > > > > >> > > > implementations to do this, might as well add it to RLM)
> > > etc.
> > > > > > > >> > > >
> > > > > > > >> > > > --
> > > > > > > >> > > > Divij Vaidya
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> > > > > > <jun@confluent.io.invalid
> > > > > > > >
> > > > > > > >> > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi, Divij,
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks for the reply.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Is the new method enough for doing size-based
> > > retention? It
> > > > > > > gives
> > > > > > > >> the
> > > > > > > >> > > > total
> > > > > > > >> > > > > size of the remote segments, but it seems that we
> > still
> > > > > don't
> > > > > > > know
> > > > > > > >> > the
> > > > > > > >> > > > > exact total size for a log since there could be
> > > overlapping
> > > > > > > >> segments
> > > > > > > >> > > > > between the remote and the local segments.
> > > > > > > >> > > > >
> > > > > > > >> > > > > You mentioned a disk-less implementation. Do you
> > > imagine all
> > > > > > > >> accesses
> > > > > > > >> > > to
> > > > > > > >> > > > > remote metadata will be across the network or will
> > > there be
> > > > > > some
> > > > > > > >> > local
> > > > > > > >> > > > > in-memory cache?
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks,
> > > > > > > >> > > > >
> > > > > > > >> > > > > Jun
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > > > > > > >> divijvaidya13@gmail.com
> > > > > > > >> > >
> > > > > > > >> > > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > > > The method is needed for RLMM implementations which
> > > fetch
> > > > > > the
> > > > > > > >> > > > information
> > > > > > > >> > > > > > over the network and not for the disk based
> > > > > implementations
> > > > > > > >> (such
> > > > > > > >> > as
> > > > > > > >> > > > the
> > > > > > > >> > > > > > default topic based RLMM).
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > I would argue that adding this API makes the
> > interface
> > > > > more
> > > > > > > >> generic
> > > > > > > >> > > > than
> > > > > > > >> > > > > > what it is today. This is because, with the current
> > > APIs
> > > > > an
> > > > > > > >> > > implementor
> > > > > > > >> > > > > is
> > > > > > > >> > > > > > restricted to use disk based RLMM solutions only
> > > (i.e. the
> > > > > > > >> default
> > > > > > > >> > > > > > solution) whereas if we add this new API, we unblock
> > > usage
> > > > > > of
> > > > > > > >> > network
> > > > > > > >> > > > > based
> > > > > > > >> > > > > > RLMM implementations such as databases.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > > > > > > <jun@confluent.io.invalid
> > > > > > > >> >
> > > > > > > >> > > > wrote:
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > > Hi, Divij,
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Thanks for the reply.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Point#2. My high level question is that is the new
> > > > > method
> > > > > > > >> needed
> > > > > > > >> > > for
> > > > > > > >> > > > > > every
> > > > > > > >> > > > > > > implementation of remote storage or just for a
> > > specific
> > > > > > > >> > > > implementation.
> > > > > > > >> > > > > > The
> > > > > > > >> > > > > > > issues that you pointed out exist for the default
> > > > > > > >> implementation
> > > > > > > >> > of
> > > > > > > >> > > > > RLMM
> > > > > > > >> > > > > > as
> > > > > > > >> > > > > > > well and so far, the default implementation hasn't
> > > > > found a
> > > > > > > >> need
> > > > > > > >> > > for a
> > > > > > > >> > > > > > > similar new method. For public interface, ideally
> > we
> > > > > want
> > > > > > to
> > > > > > > >> make
> > > > > > > >> > > it
> > > > > > > >> > > > > more
> > > > > > > >> > > > > > > general.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Thanks,
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Jun
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
> > > > > > > >> > > > divijvaidya13@gmail.com>
> > > > > > > >> > > > > > > wrote:
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > Thank you Jun and Alex for your comments.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > Point#1: You are right Jun. As Alex mentioned,
> > the
> > > > > > > "derived
> > > > > > > >> > > > metadata"
> > > > > > > >> > > > > > can
> > > > > > > >> > > > > > > > increase the size of cached metadata by a factor
> > > of 10
> > > > > > but
> > > > > > > >> it
> > > > > > > >> > > > should
> > > > > > > >> > > > > be
> > > > > > > >> > > > > > > ok
> > > > > > > >> > > > > > > > to cache just the actual metadata. My point
> > about
> > > size
> > > > > > > >> being a
> > > > > > > >> > > > > > limitation
> > > > > > > >> > > > > > > > for using cache is not valid anymore.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > Point#2: For a new replica, it would still have
> > to
> > > > > fetch
> > > > > > > the
> > > > > > > >> > > > metadata
> > > > > > > >> > > > > > > over
> > > > > > > >> > > > > > > > the network to initiate the warm up of the cache
> > > and
> > > > > > > hence,
> > > > > > > >> > > > increase
> > > > > > > >> > > > > > the
> > > > > > > >> > > > > > > > start time of the archival process. Please also
> > > note
> > > > > the
> > > > > > > >> > > > > repercussions
> > > > > > > >> > > > > > of
> > > > > > > >> > > > > > > > the warm up scan that Alex mentioned in this
> > > thread as
> > > > > > > part
> > > > > > > >> of
> > > > > > > >> > > > > #102.2.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that.
> > My
> > > > > point
> > > > > > > >> about
> > > > > > > >> > > size
> > > > > > > >> > > > > > being
> > > > > > > >> > > > > > > a
> > > > > > > >> > > > > > > > limitation for using cache is not valid anymore.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > 101#: Alex, if I understand correctly, you are
> > > > > > suggesting
> > > > > > > to
> > > > > > > >> > > cache
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > total size at the leader and update it on
> > > archival.
> > > > > This
> > > > > > > >> > wouldn't
> > > > > > > >> > > > > work
> > > > > > > >> > > > > > > for
> > > > > > > >> > > > > > > > cases when the leader restarts where we would
> > > have to
> > > > > > > make a
> > > > > > > >> > full
> > > > > > > >> > > > > scan
> > > > > > > >> > > > > > > > to update the total size entry on startup. We
> > > expect
> > > > > > users
> > > > > > > >> to
> > > > > > > >> > > store
> > > > > > > >> > > > > > data
> > > > > > > >> > > > > > > > over longer duration in remote storage which
> > > increases
> > > > > > the
> > > > > > > >> > > > likelihood
> > > > > > > >> > > > > > of
> > > > > > > >> > > > > > > > leader restarts / failovers.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > 102#.1: I don't think that the current design
> > > > > > accommodates
> > > > > > > >> the
> > > > > > > >> > > fact
> > > > > > > >> > > > > > that
> > > > > > > >> > > > > > > > data corruption could happen at the RLMM plugin
> > > (we
> > > > > > don't
> > > > > > > >> have
> > > > > > > >> > > > > checksum
> > > > > > > >> > > > > > > as
> > > > > > > >> > > > > > > > a field in metadata as part of KIP405). If data
> > > > > > corruption
> > > > > > > >> > > occurs,
> > > > > > > >> > > > w/
> > > > > > > >> > > > > > or
> > > > > > > >> > > > > > > > w/o the cache, it would be a different problem
> > to
> > > > > > solve. I
> > > > > > > >> > would
> > > > > > > >> > > > like
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > > keep this outside the scope of this KIP.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > 102#.2: Agree. This remains as the main concern
> > > for
> > > > > > using
> > > > > > > >> the
> > > > > > > >> > > cache
> > > > > > > >> > > > > to
> > > > > > > >> > > > > > > > fetch total size.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > Regards,
> > > > > > > >> > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre
> > > Dupriez <
> > > > > > > >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > > Hi Divij,
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Thanks for the KIP. Please find some comments
> > > based
> > > > > on
> > > > > > > >> what I
> > > > > > > >> > > > read
> > > > > > > >> > > > > on
> > > > > > > >> > > > > > > > > this thread so far - apologies for the repeats
> > > and
> > > > > the
> > > > > > > >> late
> > > > > > > >> > > > reply.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > If I understand correctly, one of the main
> > > elements
> > > > > of
> > > > > > > >> > > discussion
> > > > > > > >> > > > > is
> > > > > > > >> > > > > > > > > about caching in Kafka versus delegation of
> > > > > providing
> > > > > > > the
> > > > > > > >> > > remote
> > > > > > > >> > > > > size
> > > > > > > >> > > > > > > > > of a topic-partition to the plugin.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > A few comments:
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > 100. The size of the “derived metadata” which
> > is
> > > > > > managed
> > > > > > > >> by
> > > > > > > >> > the
> > > > > > > >> > > > > > plugin
> > > > > > > >> > > > > > > > > to represent an rlmMetadata can indeed be
> > close
> > > to 1
> > > > > > kB
> > > > > > > on
> > > > > > > >> > > > average
> > > > > > > >> > > > > > > > > depending on its own internal structure, e.g.
> > > the
> > > > > > > >> redundancy
> > > > > > > >> > it
> > > > > > > >> > > > > > > > > enforces (unfortunately resulting to
> > > duplication),
> > > > > > > >> additional
> > > > > > > >> > > > > > > > > information such as checksums and primary and
> > > > > > secondary
> > > > > > > >> > > indexable
> > > > > > > >> > > > > > > > > keys. But indeed, the rlmMetadata is itself a
> > > > > lighter
> > > > > > > data
> > > > > > > >> > > > > structure
> > > > > > > >> > > > > > > > > by a factor of 10. And indeed, instead of
> > > caching
> > > > > the
> > > > > > > >> > “derived
> > > > > > > >> > > > > > > > > metadata”, only the rlmMetadata could be,
> > which
> > > > > should
> > > > > > > >> > address
> > > > > > > >> > > > the
> > > > > > > >> > > > > > > > > concern regarding the memory occupancy of the
> > > cache.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > 101. I am not sure I fully understand why we
> > > would
> > > > > > need
> > > > > > > to
> > > > > > > >> > > cache
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > > list of rlmMetadata to retain the remote size
> > > of a
> > > > > > > >> > > > topic-partition.
> > > > > > > >> > > > > > > > > Since the leader of a topic-partition is, in
> > > > > > > >> non-degenerated
> > > > > > > >> > > > cases,
> > > > > > > >> > > > > > > > > the only actor which can mutate the remote
> > part
> > > of
> > > > > the
> > > > > > > >> > > > > > > > > topic-partition, hence its size, it could in
> > > theory
> > > > > > only
> > > > > > > >> > cache
> > > > > > > >> > > > the
> > > > > > > >> > > > > > > > > size of the remote log once it has calculated
> > > it? In
> > > > > > > which
> > > > > > > >> > case
> > > > > > > >> > > > > there
> > > > > > > >> > > > > > > > > would not be any problem regarding the size of
> > > the
> > > > > > > caching
> > > > > > > >> > > > > strategy.
> > > > > > > >> > > > > > > > > Did I miss something there?
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > 102. There may be a few challenges to consider
> > > with
> > > > > > > >> caching:
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > 102.1) As mentioned above, the caching
> > strategy
> > > > > > assumes
> > > > > > > no
> > > > > > > >> > > > mutation
> > > > > > > >> > > > > > > > > outside the lifetime of a leader. While this
> > is
> > > true
> > > > > > in
> > > > > > > >> the
> > > > > > > >> > > > normal
> > > > > > > >> > > > > > > > > course of operation, there could be accidental
> > > > > > mutation
> > > > > > > >> > outside
> > > > > > > >> > > > of
> > > > > > > >> > > > > > the
> > > > > > > >> > > > > > > > > leader and a loss of consistency between the
> > > cached
> > > > > > > state
> > > > > > > >> and
> > > > > > > >> > > the
> > > > > > > >> > > > > > > > > actual remote representation of the log. E.g.
> > > > > > > split-brain
> > > > > > > >> > > > > scenarios,
> > > > > > > >> > > > > > > > > bugs in the plugins, bugs in external systems
> > > with
> > > > > > > >> mutating
> > > > > > > >> > > > access
> > > > > > > >> > > > > on
> > > > > > > >> > > > > > > > > the derived metadata. In the worst case, a
> > drift
> > > > > > between
> > > > > > > >> the
> > > > > > > >> > > > cached
> > > > > > > >> > > > > > > > > size and the actual size could lead to
> > > over-deleting
> > > > > > > >> remote
> > > > > > > >> > > data
> > > > > > > >> > > > > > which
> > > > > > > >> > > > > > > > > is a durability risk.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > The alternative you propose, by making the
> > > plugin
> > > > > the
> > > > > > > >> source
> > > > > > > >> > of
> > > > > > > >> > > > > truth
> > > > > > > >> > > > > > > > > w.r.t. to the size of the remote log, can make
> > > it
> > > > > > easier
> > > > > > > >> to
> > > > > > > >> > > avoid
> > > > > > > >> > > > > > > > > inconsistencies between plugin-managed
> > metadata
> > > and
> > > > > > the
> > > > > > > >> > remote
> > > > > > > >> > > > log
> > > > > > > >> > > > > > > > > from the perspective of Kafka. On the other
> > > hand,
> > > > > > plugin
> > > > > > > >> > > vendors
> > > > > > > >> > > > > > would
> > > > > > > >> > > > > > > > > have to implement it with the expected
> > > efficiency to
> > > > > > > have
> > > > > > > >> it
> > > > > > > >> > > > yield
> > > > > > > >> > > > > > > > > benefits.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > 102.2) As you mentioned, the caching strategy
> > in
> > > > > Kafka
> > > > > > > >> would
> > > > > > > >> > > > still
> > > > > > > >> > > > > > > > > require one iteration over the list of
> > > rlmMetadata
> > > > > > when
> > > > > > > >> the
> > > > > > > >> > > > > > leadership
> > > > > > > >> > > > > > > > > of a topic-partition is assigned to a broker,
> > > while
> > > > > > the
> > > > > > > >> > plugin
> > > > > > > >> > > > can
> > > > > > > >> > > > > > > > > offer alternative constant-time approaches.
> > This
> > > > > > > >> calculation
> > > > > > > >> > > > cannot
> > > > > > > >> > > > > > be
> > > > > > > >> > > > > > > > > put on the LeaderAndIsr path and would be
> > > performed
> > > > > in
> > > > > > > the
> > > > > > > >> > > > > > background.
> > > > > > > >> > > > > > > > > In case of bulk leadership migration, listing
> > > the
> > > > > > > >> rlmMetadata
> > > > > > > >> > > > could
> > > > > > > >> > > > > > a)
> > > > > > > >> > > > > > > > > result in request bursts to any backend system
> > > the
> > > > > > > plugin
> > > > > > > >> may
> > > > > > > >> > > use
> > > > > > > >> > > > > > > > > [which shouldn’t be a problem for
> > > high-throughput
> > > > > data
> > > > > > > >> stores
> > > > > > > >> > > but
> > > > > > > >> > > > > > > > > could have cost implications] b) increase
> > > > > utilisation
> > > > > > > >> > timespan
> > > > > > > >> > > of
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > > RLM threads for these calculations potentially
> > > > > leading
> > > > > > > to
> > > > > > > >> > > > transient
> > > > > > > >> > > > > > > > > starvation of tasks queued for, typically,
> > > > > offloading
> > > > > > > >> > > operations
> > > > > > > >> > > > c)
> > > > > > > >> > > > > > > > > could have a non-marginal CPU footprint on
> > > hardware
> > > > > > with
> > > > > > > >> > strict
> > > > > > > >> > > > > > > > > resource constraints. All these elements could
> > > have
> > > > > an
> > > > > > > >> impact
> > > > > > > >> > > to
> > > > > > > >> > > > > some
> > > > > > > >> > > > > > > > > degree depending on the operational
> > environment.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > From a design perspective, one question is
> > > where we
> > > > > > want
> > > > > > > >> the
> > > > > > > >> > > > source
> > > > > > > >> > > > > > of
> > > > > > > >> > > > > > > > > truth w.r.t. remote log size to be during the
> > > > > lifetime
> > > > > > > of
> > > > > > > >> a
> > > > > > > >> > > > leader.
> > > > > > > >> > > > > > > > > The responsibility of maintaining a consistent
> > > > > > > >> representation
> > > > > > > >> > > of
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > > remote log is shared by Kafka and the plugin.
> > > Which
> > > > > > > >> system is
> > > > > > > >> > > > best
> > > > > > > >> > > > > > > > > placed to maintain such a state while
> > providing
> > > the
> > > > > > > >> highest
> > > > > > > >> > > > > > > > > consistency guarantees is something both Kafka
> > > and
> > > > > > > plugin
> > > > > > > >> > > > designers
> > > > > > > >> > > > > > > > > could help understand better.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Many thanks,
> > > > > > > >> > > > > > > > > Alexandre
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > > > > > > >> > <jun@confluent.io.invalid
> > > > > > > >> > > >
> > > > > > > >> > > > a
> > > > > > > >> > > > > > > > écrit :
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Hi, Divij,
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Thanks for the reply.
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Point #1. Is the average remote segment
> > > metadata
> > > > > > > really
> > > > > > > >> > 1KB?
> > > > > > > >> > > > > What's
> > > > > > > >> > > > > > > > > listed
> > > > > > > >> > > > > > > > > > in the public interface is probably well
> > > below 100
> > > > > > > >> bytes.
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Point #2. I guess you are assuming that each
> > > > > broker
> > > > > > > only
> > > > > > > >> > > caches
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > > remote
> > > > > > > >> > > > > > > > > > segment metadata in memory. An alternative
> > > > > approach
> > > > > > is
> > > > > > > >> to
> > > > > > > >> > > cache
> > > > > > > >> > > > > > them
> > > > > > > >> > > > > > > in
> > > > > > > >> > > > > > > > > > both memory and local disk. That way, on
> > > broker
> > > > > > > restart,
> > > > > > > >> > you
> > > > > > > >> > > > just
> > > > > > > >> > > > > > > need
> > > > > > > >> > > > > > > > to
> > > > > > > >> > > > > > > > > > fetch the new remote segments' metadata
> > using
> > > the
> > > > > > > >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > topicIdPartition,
> > > > > > > >> > int
> > > > > > > >> > > > > > > > leaderEpoch)
> > > > > > > >> > > > > > > > > > api. Will that work?
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Point #3. Thanks for the explanation and it
> > > sounds
> > > > > > > good.
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Thanks,
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Jun
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij
> > Vaidya <
> > > > > > > >> > > > > > > divijvaidya13@gmail.com>
> > > > > > > >> > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > > Hi Jun
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > There are three points that I would like
> > to
> > > > > > present
> > > > > > > >> here:
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > 1. We would require a large cache size to
> > > > > > > efficiently
> > > > > > > >> > cache
> > > > > > > >> > > > all
> > > > > > > >> > > > > > > > segment
> > > > > > > >> > > > > > > > > > > metadata.
> > > > > > > >> > > > > > > > > > > 2. Linear scan of all metadata at broker
> > > startup
> > > > > > to
> > > > > > > >> > > populate
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > cache
> > > > > > > >> > > > > > > > > will
> > > > > > > >> > > > > > > > > > > be slow and will impact the archival
> > > process.
> > > > > > > >> > > > > > > > > > > 3. There is no other use case where a full
> > > scan
> > > > > of
> > > > > > > >> > segment
> > > > > > > >> > > > > > metadata
> > > > > > > >> > > > > > > > is
> > > > > > > >> > > > > > > > > > > required.
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > Let's start by quantifying 1. Here's my
> > > estimate
> > > > > > for
> > > > > > > >> the
> > > > > > > >> > > size
> > > > > > > >> > > > > of
> > > > > > > >> > > > > > > the
> > > > > > > >> > > > > > > > > cache.
> > > > > > > >> > > > > > > > > > > Average size of segment metadata = 1KB.
> > This
> > > > > could
> > > > > > > be
> > > > > > > >> > more
> > > > > > > >> > > if
> > > > > > > >> > > > > we
> > > > > > > >> > > > > > > have
> > > > > > > >> > > > > > > > > > > frequent leader failover with a large
> > > number of
> > > > > > > leader
> > > > > > > >> > > epochs
> > > > > > > >> > > > > > being
> > > > > > > >> > > > > > > > > stored
> > > > > > > >> > > > > > > > > > > per segment.
> > > > > > > >> > > > > > > > > > > Segment size = 100MB. Users will prefer to
> > > > > reduce
> > > > > > > the
> > > > > > > >> > > segment
> > > > > > > >> > > > > > size
> > > > > > > >> > > > > > > > > from the
> > > > > > > >> > > > > > > > > > > default value of 1GB to ensure timely
> > > archival
> > > > > of
> > > > > > > data
> > > > > > > >> > > since
> > > > > > > >> > > > > data
> > > > > > > >> > > > > > > > from
> > > > > > > >> > > > > > > > > > > active segment is not archived.
> > > > > > > >> > > > > > > > > > > Cache size = num segments * avg. segment
> > > > > metadata
> > > > > > > >> size =
> > > > > > > >> > > > > > > > > (100TB/100MB)*1KB
> > > > > > > >> > > > > > > > > > > = 1GB.
> > > > > > > >> > > > > > > > > > > While 1GB for cache may not sound like a
> > > large
> > > > > > > number
> > > > > > > >> for
> > > > > > > >> > > > > larger
> > > > > > > >> > > > > > > > > machines,
> > > > > > > >> > > > > > > > > > > it does eat into the memory as an
> > additional
> > > > > cache
> > > > > > > and
> > > > > > > >> > > makes
> > > > > > > >> > > > > use
> > > > > > > >> > > > > > > > cases
> > > > > > > >> > > > > > > > > with
> > > > > > > >> > > > > > > > > > > large data retention with low throughout
> > > > > expensive
> > > > > > > >> (where
> > > > > > > >> > > > such
> > > > > > > >> > > > > > use
> > > > > > > >> > > > > > > > case
> > > > > > > >> > > > > > > > > > > would could use smaller machines).
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > About point#2:
> > > > > > > >> > > > > > > > > > > Even if we say that all segment metadata
> > > can fit
> > > > > > > into
> > > > > > > >> the
> > > > > > > >> > > > > cache,
> > > > > > > >> > > > > > we
> > > > > > > >> > > > > > > > > will
> > > > > > > >> > > > > > > > > > > need to populate the cache on broker
> > > startup. It
> > > > > > > would
> > > > > > > >> > not
> > > > > > > >> > > be
> > > > > > > >> > > > > in
> > > > > > > >> > > > > > > the
> > > > > > > >> > > > > > > > > > > critical patch of broker startup and hence
> > > won't
> > > > > > > >> impact
> > > > > > > >> > the
> > > > > > > >> > > > > > startup
> > > > > > > >> > > > > > > > > time.
> > > > > > > >> > > > > > > > > > > But it will impact the time when we could
> > > start
> > > > > > the
> > > > > > > >> > > archival
> > > > > > > >> > > > > > > process
> > > > > > > >> > > > > > > > > since
> > > > > > > >> > > > > > > > > > > the RLM thread pool will be blocked on the
> > > first
> > > > > > > call
> > > > > > > >> to
> > > > > > > >> > > > > > > > > > > listRemoteLogSegments(). To scan metadata
> > > for
> > > > > 1MM
> > > > > > > >> > segments
> > > > > > > >> > > > > > > (computed
> > > > > > > >> > > > > > > > > above)
> > > > > > > >> > > > > > > > > > > and transfer 1GB data over the network
> > from
> > > a
> > > > > RLMM
> > > > > > > >> such
> > > > > > > >> > as
> > > > > > > >> > > a
> > > > > > > >> > > > > > remote
> > > > > > > >> > > > > > > > > > > database would be in the order of minutes
> > > > > > (depending
> > > > > > > >> on
> > > > > > > >> > how
> > > > > > > >> > > > > > > efficient
> > > > > > > >> > > > > > > > > the
> > > > > > > >> > > > > > > > > > > scan is with the RLMM implementation).
> > > > > Although, I
> > > > > > > >> would
> > > > > > > >> > > > > concede
> > > > > > > >> > > > > > > that
> > > > > > > >> > > > > > > > > > > having RLM threads blocked for a few
> > > minutes is
> > > > > > > >> perhaps
> > > > > > > >> > OK
> > > > > > > >> > > > but
> > > > > > > >> > > > > if
> > > > > > > >> > > > > > > we
> > > > > > > >> > > > > > > > > > > introduce the new API proposed in the KIP,
> > > we
> > > > > > would
> > > > > > > >> have
> > > > > > > >> > a
> > > > > > > >> > > > > > > > > > > deterministic startup time for RLM. Adding
> > > the
> > > > > API
> > > > > > > >> comes
> > > > > > > >> > > at a
> > > > > > > >> > > > > low
> > > > > > > >> > > > > > > > cost
> > > > > > > >> > > > > > > > > and
> > > > > > > >> > > > > > > > > > > I believe the trade off is worth it.
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > About point#3:
> > > > > > > >> > > > > > > > > > > We can use
> > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > >> > > > > > topicIdPartition,
> > > > > > > >> > > > > > > > int
> > > > > > > >> > > > > > > > > > > leaderEpoch) to calculate the segments
> > > eligible
> > > > > > for
> > > > > > > >> > > deletion
> > > > > > > >> > > > > > (based
> > > > > > > >> > > > > > > > on
> > > > > > > >> > > > > > > > > size
> > > > > > > >> > > > > > > > > > > retention) where leader epoch(s) belong to
> > > the
> > > > > > > current
> > > > > > > >> > > leader
> > > > > > > >> > > > > > epoch
> > > > > > > >> > > > > > > > > chain.
> > > > > > > >> > > > > > > > > > > I understand that it may lead to segments
> > > > > > belonging
> > > > > > > to
> > > > > > > >> > > other
> > > > > > > >> > > > > > epoch
> > > > > > > >> > > > > > > > > lineage
> > > > > > > >> > > > > > > > > > > not getting deleted and would require a
> > > separate
> > > > > > > >> > mechanism
> > > > > > > >> > > to
> > > > > > > >> > > > > > > delete
> > > > > > > >> > > > > > > > > them.
> > > > > > > >> > > > > > > > > > > The separate mechanism would anyways be
> > > required
> > > > > > to
> > > > > > > >> > delete
> > > > > > > >> > > > > these
> > > > > > > >> > > > > > > > > "leaked"
> > > > > > > >> > > > > > > > > > > segments as there are other cases which
> > > could
> > > > > lead
> > > > > > > to
> > > > > > > >> > leaks
> > > > > > > >> > > > > such
> > > > > > > >> > > > > > as
> > > > > > > >> > > > > > > > > network
> > > > > > > >> > > > > > > > > > > problems with RSM mid way writing through.
> > > > > segment
> > > > > > > >> etc.
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > Thank you for the replies so far. They
> > have
> > > made
> > > > > > me
> > > > > > > >> > > re-think
> > > > > > > >> > > > my
> > > > > > > >> > > > > > > > > assumptions
> > > > > > > >> > > > > > > > > > > and this dialogue has been very
> > > constructive for
> > > > > > me.
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > Regards,
> > > > > > > >> > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
> > > > > > > >> > > > > > <jun@confluent.io.invalid
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > Hi, Divij,
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > Thanks for the reply.
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > It's true that the data in Kafka could
> > be
> > > kept
> > > > > > > >> longer
> > > > > > > >> > > with
> > > > > > > >> > > > > > > KIP-405.
> > > > > > > >> > > > > > > > > How
> > > > > > > >> > > > > > > > > > > > much data do you envision to have per
> > > broker?
> > > > > > For
> > > > > > > >> 100TB
> > > > > > > >> > > > data
> > > > > > > >> > > > > > per
> > > > > > > >> > > > > > > > > broker,
> > > > > > > >> > > > > > > > > > > > with 1GB segment and segment metadata of
> > > 100
> > > > > > > bytes,
> > > > > > > >> it
> > > > > > > >> > > > > requires
> > > > > > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit
> > in
> > > > > > memory.
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > > > > > > >> > listRemoteLogSegments()
> > > > > > > >> > > > > > methods.
> > > > > > > >> > > > > > > > > The one
> > > > > > > >> > > > > > > > > > > > you listed
> > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > >> > > > > > > topicIdPartition,
> > > > > > > >> > > > > > > > > int
> > > > > > > >> > > > > > > > > > > > leaderEpoch) does return data in offset
> > > order.
> > > > > > > >> However,
> > > > > > > >> > > the
> > > > > > > >> > > > > > other
> > > > > > > >> > > > > > > > > > > > one
> > listRemoteLogSegments(TopicIdPartition
> > > > > > > >> > > > topicIdPartition)
> > > > > > > >> > > > > > > > doesn't
> > > > > > > >> > > > > > > > > > > > specify the return order. I assume that
> > > you
> > > > > need
> > > > > > > the
> > > > > > > >> > > latter
> > > > > > > >> > > > > to
> > > > > > > >> > > > > > > > > calculate
> > > > > > > >> > > > > > > > > > > > the segment size?
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > Thanks,
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > Jun
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij
> > > Vaidya
> > > > > <
> > > > > > > >> > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > >> > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > *Jun,*
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > *"the default implementation of RLMM
> > > does
> > > > > > local
> > > > > > > >> > > caching,
> > > > > > > >> > > > > > > right?"*
> > > > > > > >> > > > > > > > > > > > > Yes, Jun. The default implementation
> > of
> > > RLMM
> > > > > > > does
> > > > > > > >> > > indeed
> > > > > > > >> > > > > > cache
> > > > > > > >> > > > > > > > the
> > > > > > > >> > > > > > > > > > > > segment
> > > > > > > >> > > > > > > > > > > > > metadata today, hence, it won't work
> > > for use
> > > > > > > cases
> > > > > > > >> > when
> > > > > > > >> > > > the
> > > > > > > >> > > > > > > > number
> > > > > > > >> > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > segments in remote storage is large
> > > enough
> > > > > to
> > > > > > > >> exceed
> > > > > > > >> > > the
> > > > > > > >> > > > > size
> > > > > > > >> > > > > > > of
> > > > > > > >> > > > > > > > > cache.
> > > > > > > >> > > > > > > > > > > > As
> > > > > > > >> > > > > > > > > > > > > part of this KIP, I will implement the
> > > new
> > > > > > > >> proposed
> > > > > > > >> > API
> > > > > > > >> > > > in
> > > > > > > >> > > > > > the
> > > > > > > >> > > > > > > > > default
> > > > > > > >> > > > > > > > > > > > > implementation of RLMM but the
> > > underlying
> > > > > > > >> > > implementation
> > > > > > > >> > > > > will
> > > > > > > >> > > > > > > > > still be
> > > > > > > >> > > > > > > > > > > a
> > > > > > > >> > > > > > > > > > > > > scan. I will pick up optimizing that
> > in
> > > a
> > > > > > > separate
> > > > > > > >> > PR.
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > *"we also cache all segment metadata
> > in
> > > the
> > > > > > > >> brokers
> > > > > > > >> > > > without
> > > > > > > >> > > > > > > > > KIP-405. Do
> > > > > > > >> > > > > > > > > > > > you
> > > > > > > >> > > > > > > > > > > > > see a need to change that?"*
> > > > > > > >> > > > > > > > > > > > > Please correct me if I am wrong here
> > > but we
> > > > > > > cache
> > > > > > > >> > > > metadata
> > > > > > > >> > > > > > for
> > > > > > > >> > > > > > > > > segments
> > > > > > > >> > > > > > > > > > > > > "residing in local storage". The size
> > > of the
> > > > > > > >> current
> > > > > > > >> > > > cache
> > > > > > > >> > > > > > > works
> > > > > > > >> > > > > > > > > fine
> > > > > > > >> > > > > > > > > > > for
> > > > > > > >> > > > > > > > > > > > > the scale of the number of segments
> > > that we
> > > > > > > >> expect to
> > > > > > > >> > > > store
> > > > > > > >> > > > > > in
> > > > > > > >> > > > > > > > > local
> > > > > > > >> > > > > > > > > > > > > storage. After KIP-405, that cache
> > will
> > > > > > continue
> > > > > > > >> to
> > > > > > > >> > > store
> > > > > > > >> > > > > > > > metadata
> > > > > > > >> > > > > > > > > for
> > > > > > > >> > > > > > > > > > > > > segments which are residing in local
> > > storage
> > > > > > and
> > > > > > > >> > hence,
> > > > > > > >> > > > we
> > > > > > > >> > > > > > > don't
> > > > > > > >> > > > > > > > > need
> > > > > > > >> > > > > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > change that. For segments which have
> > > been
> > > > > > > >> offloaded
> > > > > > > >> > to
> > > > > > > >> > > > > remote
> > > > > > > >> > > > > > > > > storage,
> > > > > > > >> > > > > > > > > > > it
> > > > > > > >> > > > > > > > > > > > > would rely on RLMM. Note that the
> > scale
> > > of
> > > > > > data
> > > > > > > >> > stored
> > > > > > > >> > > in
> > > > > > > >> > > > > > RLMM
> > > > > > > >> > > > > > > is
> > > > > > > >> > > > > > > > > > > > different
> > > > > > > >> > > > > > > > > > > > > from local cache because the number of
> > > > > > segments
> > > > > > > is
> > > > > > > >> > > > expected
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > be
> > > > > > > >> > > > > > > > > much
> > > > > > > >> > > > > > > > > > > > > larger than what current
> > implementation
> > > > > stores
> > > > > > > in
> > > > > > > >> > local
> > > > > > > >> > > > > > > storage.
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > 2,3,4:
> > > > > > > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > does
> > > > > > > >> > > > > > > > > specify
> > > > > > > >> > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > order i.e. it returns the segments
> > > sorted by
> > > > > > > first
> > > > > > > >> > > offset
> > > > > > > >> > > > > in
> > > > > > > >> > > > > > > > > ascending
> > > > > > > >> > > > > > > > > > > > > order. I am copying the API docs for
> > > KIP-405
> > > > > > > here
> > > > > > > >> for
> > > > > > > >> > > > your
> > > > > > > >> > > > > > > > > reference
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > *Returns iterator of remote log
> > segment
> > > > > > > metadata,
> > > > > > > >> > > sorted
> > > > > > > >> > > > by
> > > > > > > >> > > > > > > > {@link
> > > > > > > >> > > > > > > > > > > > >
> > RemoteLogSegmentMetadata#startOffset()}
> > > > > > > >> inascending
> > > > > > > >> > > order
> > > > > > > >> > > > > > which
> > > > > > > >> > > > > > > > > > > contains
> > > > > > > >> > > > > > > > > > > > > the given leader epoch. This is used
> > by
> > > > > remote
> > > > > > > log
> > > > > > > >> > > > > retention
> > > > > > > >> > > > > > > > > management
> > > > > > > >> > > > > > > > > > > > > subsystemto fetch the segment metadata
> > > for a
> > > > > > > given
> > > > > > > >> > > leader
> > > > > > > >> > > > > > > > > epoch.@param
> > > > > > > >> > > > > > > > > > > > > topicIdPartition topic partition@param
> > > > > > > >> leaderEpoch
> > > > > > > >> > > > > > leader
> > > > > > > >> > > > > > > > > > > > > epoch@return
> > > > > > > >> > > > > > > > > > > > > Iterator of remote segments, sorted by
> > > start
> > > > > > > >> offset
> > > > > > > >> > in
> > > > > > > >> > > > > > > ascending
> > > > > > > >> > > > > > > > > > > order. *
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > *Luke,*
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > 5. Note that we are trying to optimize
> > > the
> > > > > > > >> efficiency
> > > > > > > >> > > of
> > > > > > > >> > > > > size
> > > > > > > >> > > > > > > > based
> > > > > > > >> > > > > > > > > > > > > retention for remote storage. KIP-405
> > > does
> > > > > not
> > > > > > > >> > > introduce
> > > > > > > >> > > > a
> > > > > > > >> > > > > > new
> > > > > > > >> > > > > > > > > config
> > > > > > > >> > > > > > > > > > > for
> > > > > > > >> > > > > > > > > > > > > periodically checking remote similar
> > to
> > > > > > > >> > > > > > > > > > > log.retention.check.interval.ms
> > > > > > > >> > > > > > > > > > > > > which is applicable for remote
> > storage.
> > > > > Hence,
> > > > > > > the
> > > > > > > >> > > metric
> > > > > > > >> > > > > > will
> > > > > > > >> > > > > > > be
> > > > > > > >> > > > > > > > > > > updated
> > > > > > > >> > > > > > > > > > > > > at the time of invoking log retention
> > > check
> > > > > > for
> > > > > > > >> > remote
> > > > > > > >> > > > tier
> > > > > > > >> > > > > > > which
> > > > > > > >> > > > > > > > > is
> > > > > > > >> > > > > > > > > > > > > pending implementation today. We can
> > > perhaps
> > > > > > > come
> > > > > > > >> > back
> > > > > > > >> > > > and
> > > > > > > >> > > > > > > update
> > > > > > > >> > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > metric description after the
> > > implementation
> > > > > of
> > > > > > > log
> > > > > > > >> > > > > retention
> > > > > > > >> > > > > > > > check
> > > > > > > >> > > > > > > > > in
> > > > > > > >> > > > > > > > > > > > > RemoteLogManager.
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > --
> > > > > > > >> > > > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke
> > > Chen <
> > > > > > > >> > > > > showuon@gmail.com
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > Hi Divij,
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > One more question about the metric:
> > > > > > > >> > > > > > > > > > > > > > I think the metric will be updated
> > > when
> > > > > > > >> > > > > > > > > > > > > > (1) each time we run the log
> > retention
> > > > > check
> > > > > > > >> (that
> > > > > > > >> > > is,
> > > > > > > >> > > > > > > > > > > > > > log.retention.check.interval.ms)
> > > > > > > >> > > > > > > > > > > > > > (2) When user explicitly call
> > > > > > getRemoteLogSize
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > Is that correct?
> > > > > > > >> > > > > > > > > > > > > > Maybe we should add a note in metric
> > > > > > > >> description,
> > > > > > > >> > > > > > otherwise,
> > > > > > > >> > > > > > > > when
> > > > > > > >> > > > > > > > > > > user
> > > > > > > >> > > > > > > > > > > > > got,
> > > > > > > >> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes,
> > > will be
> > > > > > > >> > surprised.
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > Otherwise, LGTM
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > Thank you for the KIP
> > > > > > > >> > > > > > > > > > > > > > Luke
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun
> > > Rao
> > > > > > > >> > > > > > > > <jun@confluent.io.invalid
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > Thanks for the explanation.
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > 1. Hmm, the default implementation
> > > of
> > > > > RLMM
> > > > > > > >> does
> > > > > > > >> > > local
> > > > > > > >> > > > > > > > caching,
> > > > > > > >> > > > > > > > > > > right?
> > > > > > > >> > > > > > > > > > > > > > > Currently, we also cache all
> > segment
> > > > > > > metadata
> > > > > > > >> in
> > > > > > > >> > > the
> > > > > > > >> > > > > > > brokers
> > > > > > > >> > > > > > > > > > > without
> > > > > > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need to
> > change
> > > > > that?
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation makes
> > > > > sense.
> > > > > > > >> > However,
> > > > > > > >> > > > > > > > > > > > > > > currently,
> > > > > > > >> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > > > > > doesn't
> > > > > > > >> > > > > > > > > > > > > > specify
> > > > > > > >> > > > > > > > > > > > > > > a particular order of the
> > iterator.
> > > Do
> > > > > you
> > > > > > > >> intend
> > > > > > > >> > > to
> > > > > > > >> > > > > > change
> > > > > > > >> > > > > > > > > that?
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > Thanks,
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > Jun
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM
> > Divij
> > > > > > Vaidya
> > > > > > > <
> > > > > > > >> > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > Hey Jun
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > Thank you for your comments.
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor could
> > ensure
> > > > > that
> > > > > > > >> > > > > > > > > listRemoteLogSegments()
> > > > > > > >> > > > > > > > > > > is
> > > > > > > >> > > > > > > > > > > > > > fast"*
> > > > > > > >> > > > > > > > > > > > > > > > This would be ideal but
> > > pragmatically,
> > > > > > it
> > > > > > > is
> > > > > > > >> > > > > difficult
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > > > ensure
> > > > > > > >> > > > > > > > > > > > that
> > > > > > > >> > > > > > > > > > > > > > > > listRemoteLogSegments() is fast.
> > > This
> > > > > is
> > > > > > > >> > because
> > > > > > > >> > > of
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > > > > possibility
> > > > > > > >> > > > > > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > > a
> > > > > > > >> > > > > > > > > > > > > > > > large number of segments (much
> > > larger
> > > > > > than
> > > > > > > >> what
> > > > > > > >> > > > Kafka
> > > > > > > >> > > > > > > > > currently
> > > > > > > >> > > > > > > > > > > > > handles
> > > > > > > >> > > > > > > > > > > > > > > > with local storage today) would
> > > make
> > > > > it
> > > > > > > >> > > infeasible
> > > > > > > >> > > > to
> > > > > > > >> > > > > > > adopt
> > > > > > > >> > > > > > > > > > > > > strategies
> > > > > > > >> > > > > > > > > > > > > > > such
> > > > > > > >> > > > > > > > > > > > > > > > as local caching to improve the
> > > > > > > performance
> > > > > > > >> of
> > > > > > > >> > > > > > > > > > > > listRemoteLogSegments.
> > > > > > > >> > > > > > > > > > > > > > > Apart
> > > > > > > >> > > > > > > > > > > > > > > > from caching (which won't work
> > > due to
> > > > > > size
> > > > > > > >> > > > > > limitations) I
> > > > > > > >> > > > > > > > > can't
> > > > > > > >> > > > > > > > > > > > think
> > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > > > > other strategies which may
> > > eliminate
> > > > > the
> > > > > > > >> need
> > > > > > > >> > for
> > > > > > > >> > > > IO
> > > > > > > >> > > > > > > > > > > > > > > > operations proportional to the
> > > number
> > > > > of
> > > > > > > >> total
> > > > > > > >> > > > > > segments.
> > > > > > > >> > > > > > > > > Please
> > > > > > > >> > > > > > > > > > > > > advise
> > > > > > > >> > > > > > > > > > > > > > if
> > > > > > > >> > > > > > > > > > > > > > > > you have something in mind.
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds the
> > > > > retention
> > > > > > > >> size,
> > > > > > > >> > we
> > > > > > > >> > > > need
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > > > > > determine
> > > > > > > >> > > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > > subset of segments to delete to
> > > bring
> > > > > > the
> > > > > > > >> size
> > > > > > > >> > > > within
> > > > > > > >> > > > > > the
> > > > > > > >> > > > > > > > > > > retention
> > > > > > > >> > > > > > > > > > > > > > > limit.
> > > > > > > >> > > > > > > > > > > > > > > > Do we need to call
> > > > > > > >> > > > > > > > > > >
> > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > > > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > > > > determine that?"*
> > > > > > > >> > > > > > > > > > > > > > > > Yes, we need to call
> > > > > > > >> listRemoteLogSegments() to
> > > > > > > >> > > > > > determine
> > > > > > > >> > > > > > > > > which
> > > > > > > >> > > > > > > > > > > > > > segments
> > > > > > > >> > > > > > > > > > > > > > > > should be deleted. But there is
> > a
> > > > > > > difference
> > > > > > > >> > with
> > > > > > > >> > > > the
> > > > > > > >> > > > > > use
> > > > > > > >> > > > > > > > > case we
> > > > > > > >> > > > > > > > > > > > are
> > > > > > > >> > > > > > > > > > > > > > > > trying to optimize with this
> > KIP.
> > > To
> > > > > > > >> determine
> > > > > > > >> > > the
> > > > > > > >> > > > > > subset
> > > > > > > >> > > > > > > > of
> > > > > > > >> > > > > > > > > > > > segments
> > > > > > > >> > > > > > > > > > > > > > > which
> > > > > > > >> > > > > > > > > > > > > > > > would be deleted, we only read
> > > > > metadata
> > > > > > > for
> > > > > > > >> > > > segments
> > > > > > > >> > > > > > > which
> > > > > > > >> > > > > > > > > would
> > > > > > > >> > > > > > > > > > > be
> > > > > > > >> > > > > > > > > > > > > > > deleted
> > > > > > > >> > > > > > > > > > > > > > > > via the listRemoteLogSegments().
> > > But
> > > > > to
> > > > > > > >> > determine
> > > > > > > >> > > > the
> > > > > > > >> > > > > > > > > > > totalLogSize,
> > > > > > > >> > > > > > > > > > > > > > which
> > > > > > > >> > > > > > > > > > > > > > > > is required every time retention
> > > logic
> > > > > > > >> based on
> > > > > > > >> > > > size
> > > > > > > >> > > > > > > > > executes, we
> > > > > > > >> > > > > > > > > > > > > read
> > > > > > > >> > > > > > > > > > > > > > > > metadata of *all* the segments
> > in
> > > > > remote
> > > > > > > >> > storage.
> > > > > > > >> > > > > > Hence,
> > > > > > > >> > > > > > > > the
> > > > > > > >> > > > > > > > > > > number
> > > > > > > >> > > > > > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > > > > results returned by
> > > > > > > >> > > > > > > > > > > >
> > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > > > > > > > > > > *is
> > > > > > > >> > > > > > > > > > > > > > > > different when we are
> > calculating
> > > > > > > >> totalLogSize
> > > > > > > >> > > vs.
> > > > > > > >> > > > > when
> > > > > > > >> > > > > > > we
> > > > > > > >> > > > > > > > > are
> > > > > > > >> > > > > > > > > > > > > > > determining
> > > > > > > >> > > > > > > > > > > > > > > > the subset of segments to
> > delete.
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > 3.
> > > > > > > >> > > > > > > > > > > > > > > > *"Also, what about time-based
> > > > > retention?
> > > > > > > To
> > > > > > > >> > make
> > > > > > > >> > > > that
> > > > > > > >> > > > > > > > > efficient,
> > > > > > > >> > > > > > > > > > > do
> > > > > > > >> > > > > > > > > > > > > we
> > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > >> > > > > > > > > > > > > > > > to make some additional
> > interface
> > > > > > > >> changes?"*No.
> > > > > > > >> > > > Note
> > > > > > > >> > > > > > that
> > > > > > > >> > > > > > > > > time
> > > > > > > >> > > > > > > > > > > > > > complexity
> > > > > > > >> > > > > > > > > > > > > > > > to determine the segments for
> > > > > retention
> > > > > > is
> > > > > > > >> > > > different
> > > > > > > >> > > > > > for
> > > > > > > >> > > > > > > > time
> > > > > > > >> > > > > > > > > > > based
> > > > > > > >> > > > > > > > > > > > > vs.
> > > > > > > >> > > > > > > > > > > > > > > > size based. For time based, the
> > > time
> > > > > > > >> complexity
> > > > > > > >> > > is
> > > > > > > >> > > > a
> > > > > > > >> > > > > > > > > function of
> > > > > > > >> > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > number
> > > > > > > >> > > > > > > > > > > > > > > > of segments which are "eligible
> > > for
> > > > > > > >> deletion"
> > > > > > > >> > > > (since
> > > > > > > >> > > > > we
> > > > > > > >> > > > > > > > only
> > > > > > > >> > > > > > > > > read
> > > > > > > >> > > > > > > > > > > > > > > metadata
> > > > > > > >> > > > > > > > > > > > > > > > for segments which would be
> > > deleted)
> > > > > > > >> whereas in
> > > > > > > >> > > > size
> > > > > > > >> > > > > > > based
> > > > > > > >> > > > > > > > > > > > retention,
> > > > > > > >> > > > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > > time complexity is a function of
> > > "all
> > > > > > > >> segments"
> > > > > > > >> > > > > > available
> > > > > > > >> > > > > > > > in
> > > > > > > >> > > > > > > > > > > remote
> > > > > > > >> > > > > > > > > > > > > > > storage
> > > > > > > >> > > > > > > > > > > > > > > > (metadata of all segments needs
> > > to be
> > > > > > read
> > > > > > > >> to
> > > > > > > >> > > > > calculate
> > > > > > > >> > > > > > > the
> > > > > > > >> > > > > > > > > total
> > > > > > > >> > > > > > > > > > > > > > size).
> > > > > > > >> > > > > > > > > > > > > > > As
> > > > > > > >> > > > > > > > > > > > > > > > you may observe, this KIP will
> > > bring
> > > > > the
> > > > > > > >> time
> > > > > > > >> > > > > > complexity
> > > > > > > >> > > > > > > > for
> > > > > > > >> > > > > > > > > both
> > > > > > > >> > > > > > > > > > > > > time
> > > > > > > >> > > > > > > > > > > > > > > > based retention & size based
> > > retention
> > > > > > to
> > > > > > > >> the
> > > > > > > >> > > same
> > > > > > > >> > > > > > > > function.
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > 4. Also, please note that this
> > > new API
> > > > > > > >> > introduced
> > > > > > > >> > > > in
> > > > > > > >> > > > > > this
> > > > > > > >> > > > > > > > KIP
> > > > > > > >> > > > > > > > > > > also
> > > > > > > >> > > > > > > > > > > > > > > enables
> > > > > > > >> > > > > > > > > > > > > > > > us to provide a metric for total
> > > size
> > > > > of
> > > > > > > >> data
> > > > > > > >> > > > stored
> > > > > > > >> > > > > in
> > > > > > > >> > > > > > > > > remote
> > > > > > > >> > > > > > > > > > > > > storage.
> > > > > > > >> > > > > > > > > > > > > > > > Without the API, calculation of
> > > this
> > > > > > > metric
> > > > > > > >> > will
> > > > > > > >> > > > > become
> > > > > > > >> > > > > > > > very
> > > > > > > >> > > > > > > > > > > > > expensive
> > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> > > > > > > >> > > > > > > > > > > > > > > > I understand that your
> > motivation
> > > here
> > > > > > is
> > > > > > > to
> > > > > > > >> > > avoid
> > > > > > > >> > > > > > > > polluting
> > > > > > > >> > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > interface
> > > > > > > >> > > > > > > > > > > > > > > > with optimization specific APIs
> > > and I
> > > > > > will
> > > > > > > >> > agree
> > > > > > > >> > > > with
> > > > > > > >> > > > > > > that
> > > > > > > >> > > > > > > > > goal.
> > > > > > > >> > > > > > > > > > > > But
> > > > > > > >> > > > > > > > > > > > > I
> > > > > > > >> > > > > > > > > > > > > > > > believe that this new API
> > > proposed in
> > > > > > the
> > > > > > > >> KIP
> > > > > > > >> > > > brings
> > > > > > > >> > > > > in
> > > > > > > >> > > > > > > > > > > significant
> > > > > > > >> > > > > > > > > > > > > > > > improvement and there is no
> > other
> > > work
> > > > > > > >> around
> > > > > > > >> > > > > available
> > > > > > > >> > > > > > > to
> > > > > > > >> > > > > > > > > > > achieve
> > > > > > > >> > > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > same
> > > > > > > >> > > > > > > > > > > > > > > > performance.
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > Regards,
> > > > > > > >> > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM
> > > Jun
> > > > > Rao
> > > > > > > >> > > > > > > > > <jun@confluent.io.invalid
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry for
> > > the
> > > > > late
> > > > > > > >> reply.
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > The motivation of the KIP is
> > to
> > > > > > improve
> > > > > > > >> the
> > > > > > > >> > > > > > efficiency
> > > > > > > >> > > > > > > of
> > > > > > > >> > > > > > > > > size
> > > > > > > >> > > > > > > > > > > > > based
> > > > > > > >> > > > > > > > > > > > > > > > > retention. I am not sure the
> > > > > proposed
> > > > > > > >> changes
> > > > > > > >> > > are
> > > > > > > >> > > > > > > enough.
> > > > > > > >> > > > > > > > > For
> > > > > > > >> > > > > > > > > > > > > > example,
> > > > > > > >> > > > > > > > > > > > > > > if
> > > > > > > >> > > > > > > > > > > > > > > > > the size exceeds the retention
> > > size,
> > > > > > we
> > > > > > > >> need
> > > > > > > >> > to
> > > > > > > >> > > > > > > determine
> > > > > > > >> > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > subset
> > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > > > > > segments to delete to bring
> > the
> > > size
> > > > > > > >> within
> > > > > > > >> > the
> > > > > > > >> > > > > > > retention
> > > > > > > >> > > > > > > > > > > limit.
> > > > > > > >> > > > > > > > > > > > Do
> > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > >> > > > > > > > > > > > > > > > need
> > > > > > > >> > > > > > > > > > > > > > > > > to call
> > > > > > > >> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > determine
> > > > > > > >> > > > > > > > > > > > > > > > that?
> > > > > > > >> > > > > > > > > > > > > > > > > Also, what about time-based
> > > > > retention?
> > > > > > > To
> > > > > > > >> > make
> > > > > > > >> > > > that
> > > > > > > >> > > > > > > > > efficient,
> > > > > > > >> > > > > > > > > > > do
> > > > > > > >> > > > > > > > > > > > > we
> > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > >> > > > > > > > > > > > > > > > > to make some additional
> > > interface
> > > > > > > changes?
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > An alternative approach is for
> > > the
> > > > > > RLMM
> > > > > > > >> > > > implementor
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > > make
> > > > > > > >> > > > > > > > > > > sure
> > > > > > > >> > > > > > > > > > > > > > > > > that
> > > > > > > >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > > > is
> > > > > > > >> > > > > > > > > fast
> > > > > > > >> > > > > > > > > > > > > (e.g.,
> > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > >> > > > > > > > > > > > > > > > > local caching). This way, we
> > > could
> > > > > > keep
> > > > > > > >> the
> > > > > > > >> > > > > interface
> > > > > > > >> > > > > > > > > simple.
> > > > > > > >> > > > > > > > > > > > Have
> > > > > > > >> > > > > > > > > > > > > we
> > > > > > > >> > > > > > > > > > > > > > > > > considered that?
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > Jun
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28
> > AM
> > > > > Divij
> > > > > > > >> Vaidya
> > > > > > > >> > <
> > > > > > > >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > Does anyone else have any
> > > thoughts
> > > > > > on
> > > > > > > >> this
> > > > > > > >> > > > > before I
> > > > > > > >> > > > > > > > > propose
> > > > > > > >> > > > > > > > > > > > this
> > > > > > > >> > > > > > > > > > > > > > for
> > > > > > > >> > > > > > > > > > > > > > > a
> > > > > > > >> > > > > > > > > > > > > > > > > > vote?
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > --
> > > > > > > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57
> > > PM
> > > > > > Satish
> > > > > > > >> > > Duggana
> > > > > > > >> > > > <
> > > > > > > >> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP Divij!
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > This is a nice improvement
> > > to
> > > > > > avoid
> > > > > > > >> > > > > recalculation
> > > > > > > >> > > > > > > of
> > > > > > > >> > > > > > > > > size.
> > > > > > > >> > > > > > > > > > > > > > > Customized
> > > > > > > >> > > > > > > > > > > > > > > > > > RLMMs
> > > > > > > >> > > > > > > > > > > > > > > > > > > can implement the best
> > > possible
> > > > > > > >> approach
> > > > > > > >> > by
> > > > > > > >> > > > > > caching
> > > > > > > >> > > > > > > > or
> > > > > > > >> > > > > > > > > > > > > > maintaining
> > > > > > > >> > > > > > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > > > > size
> > > > > > > >> > > > > > > > > > > > > > > > > > > in an efficient way. But
> > > this is
> > > > > > > not a
> > > > > > > >> > big
> > > > > > > >> > > > > > concern
> > > > > > > >> > > > > > > > for
> > > > > > > >> > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > default
> > > > > > > >> > > > > > > > > > > > > > > > > topic
> > > > > > > >> > > > > > > > > > > > > > > > > > > based RLMM as mentioned in
> > > the
> > > > > > KIP.
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at
> > > 18:48,
> > > > > > Divij
> > > > > > > >> > Vaidya
> > > > > > > >> > > <
> > > > > > > >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > >> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > Thank you for your
> > review
> > > > > Luke.
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that would the
> > > new
> > > > > > > >> > > > > > `RemoteLogSizeBytes`
> > > > > > > >> > > > > > > > > metric
> > > > > > > >> > > > > > > > > > > > be a
> > > > > > > >> > > > > > > > > > > > > > > > > > performance
> > > > > > > >> > > > > > > > > > > > > > > > > > > > overhead? Although we
> > > move the
> > > > > > > >> > > calculation
> > > > > > > >> > > > > to a
> > > > > > > >> > > > > > > > > seperate
> > > > > > > >> > > > > > > > > > > > API,
> > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > >> > > > > > > > > > > > > > > > > still
> > > > > > > >> > > > > > > > > > > > > > > > > > > > can't assume users will
> > > > > > implement
> > > > > > > a
> > > > > > > >> > > > > > light-weight
> > > > > > > >> > > > > > > > > method,
> > > > > > > >> > > > > > > > > > > > > right?
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > This metric would be
> > > logged
> > > > > > using
> > > > > > > >> the
> > > > > > > >> > > > > > information
> > > > > > > >> > > > > > > > > that is
> > > > > > > >> > > > > > > > > > > > > > already
> > > > > > > >> > > > > > > > > > > > > > > > > being
> > > > > > > >> > > > > > > > > > > > > > > > > > > > calculated for handling
> > > remote
> > > > > > > >> > retention
> > > > > > > >> > > > > logic,
> > > > > > > >> > > > > > > > > hence, no
> > > > > > > >> > > > > > > > > > > > > > > > additional
> > > > > > > >> > > > > > > > > > > > > > > > > > work
> > > > > > > >> > > > > > > > > > > > > > > > > > > > is required to calculate
> > > this
> > > > > > > >> metric.
> > > > > > > >> > > More
> > > > > > > >> > > > > > > > > specifically,
> > > > > > > >> > > > > > > > > > > > > > whenever
> > > > > > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls
> > > > > > > >> getRemoteLogSize
> > > > > > > >> > > > API,
> > > > > > > >> > > > > > this
> > > > > > > >> > > > > > > > > metric
> > > > > > > >> > > > > > > > > > > > > would
> > > > > > > >> > > > > > > > > > > > > > be
> > > > > > > >> > > > > > > > > > > > > > > > > > > captured.
> > > > > > > >> > > > > > > > > > > > > > > > > > > > This API call is made
> > > every
> > > > > time
> > > > > > > >> > > > > > RemoteLogManager
> > > > > > > >> > > > > > > > > wants
> > > > > > > >> > > > > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > > handle
> > > > > > > >> > > > > > > > > > > > > > > > > > expired
> > > > > > > >> > > > > > > > > > > > > > > > > > > > remote log segments
> > (which
> > > > > > should
> > > > > > > be
> > > > > > > >> > > > > periodic).
> > > > > > > >> > > > > > > > Does
> > > > > > > >> > > > > > > > > that
> > > > > > > >> > > > > > > > > > > > > > address
> > > > > > > >> > > > > > > > > > > > > > > > > your
> > > > > > > >> > > > > > > > > > > > > > > > > > > > concern?
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at
> > > 11:01
> > > > > AM
> > > > > > > >> Luke
> > > > > > > >> > > Chen
> > > > > > > >> > > > <
> > > > > > > >> > > > > > > > > > > > > showuon@gmail.com>
> > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP!
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > I think it makes sense
> > > to
> > > > > > > delegate
> > > > > > > >> > the
> > > > > > > >> > > > > > > > > responsibility
> > > > > > > >> > > > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > > > > > calculation
> > > > > > > >> > > > > > > > > > > > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > specific
> > > > > > > RemoteLogMetadataManager
> > > > > > > >> > > > > > > implementation.
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > But one thing I'm not
> > > quite
> > > > > > > sure,
> > > > > > > >> is
> > > > > > > >> > > that
> > > > > > > >> > > > > > would
> > > > > > > >> > > > > > > > > the new
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes`
> > > metric
> > > > > > be a
> > > > > > > >> > > > > performance
> > > > > > > >> > > > > > > > > overhead?
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Although we move the
> > > > > > calculation
> > > > > > > >> to a
> > > > > > > >> > > > > > seperate
> > > > > > > >> > > > > > > > > API, we
> > > > > > > >> > > > > > > > > > > > > still
> > > > > > > >> > > > > > > > > > > > > > > > can't
> > > > > > > >> > > > > > > > > > > > > > > > > > > assume
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > users will implement a
> > > > > > > >> light-weight
> > > > > > > >> > > > method,
> > > > > > > >> > > > > > > > right?
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Luke
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at
> > > 5:47
> > > > > PM
> > > > > > > >> Divij
> > > > > > > >> > > > > Vaidya <
> > > > > > > >> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Please take a look
> > at
> > > this
> > > > > > KIP
> > > > > > > >> > which
> > > > > > > >> > > > > > proposes
> > > > > > > >> > > > > > > > an
> > > > > > > >> > > > > > > > > > > > > extension
> > > > > > > >> > > > > > > > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > This
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > is my first KIP with
> > > > > Apache
> > > > > > > >> Kafka
> > > > > > > >> > > > > community
> > > > > > > >> > > > > > > so
> > > > > > > >> > > > > > > > > any
> > > > > > > >> > > > > > > > > > > > > feedback
> > > > > > > >> > > > > > > > > > > > > > > > would
> > > > > > > >> > > > > > > > > > > > > > > > > > be
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > highly
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software
> > Engineer
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.
Thanks Divij, this KIP is a super useful improvement to Tiered Storage.

I have a couple of minor comments to the KIP, otherwise I'm +1 on this
proposal:

1. APIs haven't used getter naming convention on TS as far as I can see
(e.g `RLMM#remoteLogSegmentMetadata()`). We could rename the proposed
method to `RemoteLogMetadataManager#remoteLogSize(...)` to keep it
consistent,
2. The proposal for a new metric only includes `topic` as a label. Could we
also include a `partition` label?

Cheers,
Jorge.


On Sat, 1 Jul 2023 at 22:33, Divij Vaidya <di...@gmail.com> wrote:

> Thank you folks for reviewing this KIP.
>
> Satish, I have modified the motivation to make it more clear. Now it says,
> "Since the main feature of tiered storage is storing a large amount of
> data, we expect num_remote_segments to be large. A frequent linear scan
> (i.e. listing all segment metadata) could be expensive/slower because of
> the underlying storage used by RemoteLogMetadataManager. This slowness to
> list all segment metadata could result in the loss of availability...."
>
> Jun, Kamal, Satish, if you don't have any further concerns, I would
> appreciate a vote for this KIP in the voting thread -
> https://lists.apache.org/thread/soz00990gvzodv7oyqj4ysvktrqy6xfk
>
> --
> Divij Vaidya
>
>
>
> On Sat, Jul 1, 2023 at 6:16 AM Kamal Chandraprakash <
> kamal.chandraprakash@gmail.com> wrote:
>
> > Hi Divij,
> >
> > Thanks for the explanation. LGTM.
> >
> > --
> > Kamal
> >
> > On Sat, Jul 1, 2023 at 7:28 AM Satish Duggana <sa...@gmail.com>
> > wrote:
> >
> > > Hi Divij,
> > > I am fine with having an API to compute the size as I mentioned in my
> > > earlier reply in this mail thread. But I have the below comment for
> > > the motivation for this KIP.
> > >
> > > As you discussed offline, the main issue here is listing calls for
> > > remote log segment metadata is slower because of the storage used for
> > > RLMM. These can be avoided with this new API.
> > >
> > > Please add this in the motivation section as it is one of the main
> > > motivations for the KIP.
> > >
> > > Thanks,
> > > Satish.
> > >
> > > On Sat, 1 Jul 2023 at 01:43, Jun Rao <ju...@confluent.io.invalid> wrote:
> > > >
> > > > Hi, Divij,
> > > >
> > > > Sorry for the late reply.
> > > >
> > > > Given your explanation, the new API sounds reasonable to me. Is that
> > > enough
> > > > to build the external metadata layer for the remote segments or do
> you
> > > need
> > > > some additional API changes?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <divijvaidya13@gmail.com
> >
> > > wrote:
> > > >
> > > > > Thank you for looking into this Kamal.
> > > > >
> > > > > You are right in saying that a cold start (i.e. leadership failover
> > or
> > > > > broker startup) does not impact the broker startup duration. But it
> > > does
> > > > > have the following impact:
> > > > > 1. It leads to a burst of full-scan requests to RLMM in case
> multiple
> > > > > leadership failovers occur at the same time. Even if the RLMM
> > > > > implementation has the capability to serve the total size from an
> > index
> > > > > (and hence handle this burst), we wouldn't be able to use it since
> > the
> > > > > current API necessarily calls for a full scan.
> > > > > 2. The archival (copying of data to tiered storage) process will
> > have a
> > > > > delayed start. The delayed start of archival could lead to local
> > build
> > > up
> > > > > of data which may lead to disk full.
> > > > >
> > > > > The disadvantage of adding this new API is that every provider will
> > > have to
> > > > > implement it, agreed. But I believe that this tradeoff is
> worthwhile
> > > since
> > > > > the default implementation could be the same as you mentioned, i.e.
> > > keeping
> > > > > cumulative in-memory count.
> > > > >
> > > > > --
> > > > > Divij Vaidya
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
> > > > > kamal.chandraprakash@gmail.com> wrote:
> > > > >
> > > > > > Hi Divij,
> > > > > >
> > > > > > Thanks for the KIP! Sorry for the late reply.
> > > > > >
> > > > > > Can you explain the rejected alternative-3?
> > > > > > Store the cumulative size of remote tier log in-memory at
> > > > > RemoteLogManager
> > > > > > "*Cons*: Every time a broker starts-up, it will scan through all
> > the
> > > > > > segments in the remote tier to initialise the in-memory value.
> This
> > > would
> > > > > > increase the broker start-up time."
> > > > > >
> > > > > > Keeping the source of truth to determine the remote-log-size in
> the
> > > > > leader
> > > > > > would be consistent across different implementations of the
> plugin.
> > > The
> > > > > > concern posted in the KIP is that we are calculating the
> > > remote-log-size
> > > > > on
> > > > > > each iteration of the cleaner thread (say 5 mins). If we
> calculate
> > > only
> > > > > > once during broker startup or during the leadership reassignment,
> > do
> > > we
> > > > > > still need the cache?
> > > > > >
> > > > > > The broker startup-time won't be affected by the remote log
> manager
> > > > > > initialisation. The broker continue to start accepting the new
> > > > > > produce/fetch requests, while the RLM thread in the background
> can
> > > > > > determine the remote-log-size once and start copying/deleting the
> > > > > segments.
> > > > > >
> > > > > > Thanks,
> > > > > > Kamal
> > > > > >
> > > > > > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <
> > divijvaidya13@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Satish / Jun
> > > > > > >
> > > > > > > Do you have any thoughts on this?
> > > > > > >
> > > > > > > --
> > > > > > > Divij Vaidya
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <
> > > divijvaidya13@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Jun
> > > > > > > >
> > > > > > > > It has been a while since this KIP got some attention. While
> we
> > > wait
> > > > > > for
> > > > > > > > Satish to chime in here, perhaps I can answer your question.
> > > > > > > >
> > > > > > > > > Could you explain how you exposed the log size in your
> > KIP-405
> > > > > > > > implementation?
> > > > > > > >
> > > > > > > > The APIs available in RLMM as per KIP405
> > > > > > > > are, addRemoteLogSegmentMetadata(),
> > > updateRemoteLogSegmentMetadata(),
> > > > > > > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > > > > > > putRemotePartitionDeleteMetadata(), listRemoteLogSegments(),
> > > > > > > onPartitionLeadershipChanges()
> > > > > > > > and onStopPartitions(). None of these APIs allow us to expose
> > > the log
> > > > > > > size,
> > > > > > > > hence, the only option that remains is to list all segments
> > using
> > > > > > > > listRemoteLogSegments() and aggregate them every time we
> > require
> > > to
> > > > > > > > calculate the size. Based on our prior discussion, this
> > requires
> > > > > > reading
> > > > > > > > all segment metadata which won't work for non-local RLMM
> > > > > > implementations.
> > > > > > > > Satish's implementation also performs a full scan and
> > calculates
> > > the
> > > > > > > > aggregate. see:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > > > > > > >
> > > > > > > >
> > > > > > > > Does this answer your question?
> > > > > > > >
> > > > > > > > --
> > > > > > > > Divij Vaidya
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao
> > <jun@confluent.io.invalid
> > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi, Divij,
> > > > > > > >>
> > > > > > > >> Thanks for the explanation.
> > > > > > > >>
> > > > > > > >> Good question.
> > > > > > > >>
> > > > > > > >> Hi, Satish,
> > > > > > > >>
> > > > > > > >> Could you explain how you exposed the log size in your
> KIP-405
> > > > > > > >> implementation?
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >>
> > > > > > > >> Jun
> > > > > > > >>
> > > > > > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <
> > > > > divijvaidya13@gmail.com
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hey Jun
> > > > > > > >> >
> > > > > > > >> > Yes, it is possible to maintain the log size in the cache
> > (see
> > > > > > > rejected
> > > > > > > >> > alternative#3 in the KIP) but I did not understand how it
> is
> > > > > > possible
> > > > > > > to
> > > > > > > >> > retrieve it without the new API. The log size could be
> > > calculated
> > > > > on
> > > > > > > >> > startup by scanning through the segments (though I would
> > > disagree
> > > > > > that
> > > > > > > >> this
> > > > > > > >> > is the right approach since scanning itself takes order of
> > > minutes
> > > > > > and
> > > > > > > >> > hence delay the start of archive process), and
> incrementally
> > > > > > > maintained
> > > > > > > >> > afterwards, even then, we would need an API in
> > > > > > > RemoteLogMetadataManager
> > > > > > > >> so
> > > > > > > >> > that RLM could fetch the cached size!
> > > > > > > >> >
> > > > > > > >> > If we wish to cache the size without adding a new API,
> then
> > we
> > > > > need
> > > > > > to
> > > > > > > >> > cache the size in RLM itself (instead of RLMM
> > implementation)
> > > and
> > > > > > > >> > incrementally manage it. The downside of longer archive
> time
> > > at
> > > > > > > startup
> > > > > > > >> > (due to initial scale) still remains valid in this
> > situation.
> > > > > > > >> >
> > > > > > > >> > --
> > > > > > > >> > Divij Vaidya
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao
> > > <jun@confluent.io.invalid
> > > > > >
> > > > > > > >> wrote:
> > > > > > > >> >
> > > > > > > >> > > Hi, Divij,
> > > > > > > >> > >
> > > > > > > >> > > Thanks for the explanation.
> > > > > > > >> > >
> > > > > > > >> > > If there is in-memory cache, could we maintain the log
> > size
> > > in
> > > > > the
> > > > > > > >> cache
> > > > > > > >> > > with the existing API? For example, a replica could
> make a
> > > > > > > >> > > listRemoteLogSegments(TopicIdPartition topicIdPartition)
> > > call on
> > > > > > > >> startup
> > > > > > > >> > to
> > > > > > > >> > > get the remote segment size before the current
> > leaderEpoch.
> > > The
> > > > > > > leader
> > > > > > > >> > > could then maintain the size incrementally afterwards.
> On
> > > leader
> > > > > > > >> change,
> > > > > > > >> > > other replicas can make a
> > > listRemoteLogSegments(TopicIdPartition
> > > > > > > >> > > topicIdPartition, int leaderEpoch) call to get the size
> of
> > > newly
> > > > > > > >> > generated
> > > > > > > >> > > segments.
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > >
> > > > > > > >> > > Jun
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > > > > > > divijvaidya13@gmail.com
> > > > > > > >> >
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > > Is the new method enough for doing size-based
> > retention?
> > > > > > > >> > > >
> > > > > > > >> > > > Yes. You are right in assuming that this API only
> > > provides the
> > > > > > > >> Remote
> > > > > > > >> > > > storage size (for current epoch chain). We would use
> > this
> > > API
> > > > > > for
> > > > > > > >> size
> > > > > > > >> > > > based retention along with a value of
> > > localOnlyLogSegmentSize
> > > > > > > which
> > > > > > > >> is
> > > > > > > >> > > > computed as
> > > Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > > > > > >> > > > highestOffsetWithRemoteIndex)). Hence,
> (total_log_size =
> > > > > > > >> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I
> > have
> > > > > > updated
> > > > > > > >> the
> > > > > > > >> > KIP
> > > > > > > >> > > > with this information. You can also check an example
> > > > > > > implementation
> > > > > > > >> at
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > > > Do you imagine all accesses to remote metadata will
> be
> > > > > across
> > > > > > > the
> > > > > > > >> > > network
> > > > > > > >> > > > or will there be some local in-memory cache?
> > > > > > > >> > > >
> > > > > > > >> > > > I would expect a disk-less implementation to maintain
> a
> > > finite
> > > > > > > >> > in-memory
> > > > > > > >> > > > cache for segment metadata to optimize the number of
> > > network
> > > > > > calls
> > > > > > > >> made
> > > > > > > >> > > to
> > > > > > > >> > > > fetch the data. In future, we can think about bringing
> > > this
> > > > > > finite
> > > > > > > >> size
> > > > > > > >> > > > cache into RLM itself but that's probably a
> conversation
> > > for a
> > > > > > > >> > different
> > > > > > > >> > > > KIP. There are many other things we would like to do
> to
> > > > > optimize
> > > > > > > the
> > > > > > > >> > > Tiered
> > > > > > > >> > > > storage interface such as introducing a circular
> buffer
> > /
> > > > > > > streaming
> > > > > > > >> > > > interface from RSM (so that we don't have to wait to
> > > fetch the
> > > > > > > >> entire
> > > > > > > >> > > > segment before starting to send records to the
> > consumer),
> > > > > > caching
> > > > > > > >> the
> > > > > > > >> > > > segments fetched from RSM locally (I would assume all
> > RSM
> > > > > plugin
> > > > > > > >> > > > implementations to do this, might as well add it to
> RLM)
> > > etc.
> > > > > > > >> > > >
> > > > > > > >> > > > --
> > > > > > > >> > > > Divij Vaidya
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> > > > > > <jun@confluent.io.invalid
> > > > > > > >
> > > > > > > >> > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi, Divij,
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks for the reply.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Is the new method enough for doing size-based
> > > retention? It
> > > > > > > gives
> > > > > > > >> the
> > > > > > > >> > > > total
> > > > > > > >> > > > > size of the remote segments, but it seems that we
> > still
> > > > > don't
> > > > > > > know
> > > > > > > >> > the
> > > > > > > >> > > > > exact total size for a log since there could be
> > > overlapping
> > > > > > > >> segments
> > > > > > > >> > > > > between the remote and the local segments.
> > > > > > > >> > > > >
> > > > > > > >> > > > > You mentioned a disk-less implementation. Do you
> > > imagine all
> > > > > > > >> accesses
> > > > > > > >> > > to
> > > > > > > >> > > > > remote metadata will be across the network or will
> > > there be
> > > > > > some
> > > > > > > >> > local
> > > > > > > >> > > > > in-memory cache?
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks,
> > > > > > > >> > > > >
> > > > > > > >> > > > > Jun
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > > > > > > >> divijvaidya13@gmail.com
> > > > > > > >> > >
> > > > > > > >> > > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > > > The method is needed for RLMM implementations
> which
> > > fetch
> > > > > > the
> > > > > > > >> > > > information
> > > > > > > >> > > > > > over the network and not for the disk based
> > > > > implementations
> > > > > > > >> (such
> > > > > > > >> > as
> > > > > > > >> > > > the
> > > > > > > >> > > > > > default topic based RLMM).
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > I would argue that adding this API makes the
> > interface
> > > > > more
> > > > > > > >> generic
> > > > > > > >> > > > than
> > > > > > > >> > > > > > what it is today. This is because, with the
> current
> > > APIs
> > > > > an
> > > > > > > >> > > implementor
> > > > > > > >> > > > > is
> > > > > > > >> > > > > > restricted to use disk based RLMM solutions only
> > > (i.e. the
> > > > > > > >> default
> > > > > > > >> > > > > > solution) whereas if we add this new API, we
> unblock
> > > usage
> > > > > > of
> > > > > > > >> > network
> > > > > > > >> > > > > based
> > > > > > > >> > > > > > RLMM implementations such as databases.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > > > > > > <jun@confluent.io.invalid
> > > > > > > >> >
> > > > > > > >> > > > wrote:
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > > Hi, Divij,
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Thanks for the reply.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Point#2. My high level question is that is the
> new
> > > > > method
> > > > > > > >> needed
> > > > > > > >> > > for
> > > > > > > >> > > > > > every
> > > > > > > >> > > > > > > implementation of remote storage or just for a
> > > specific
> > > > > > > >> > > > implementation.
> > > > > > > >> > > > > > The
> > > > > > > >> > > > > > > issues that you pointed out exist for the
> default
> > > > > > > >> implementation
> > > > > > > >> > of
> > > > > > > >> > > > > RLMM
> > > > > > > >> > > > > > as
> > > > > > > >> > > > > > > well and so far, the default implementation
> hasn't
> > > > > found a
> > > > > > > >> need
> > > > > > > >> > > for a
> > > > > > > >> > > > > > > similar new method. For public interface,
> ideally
> > we
> > > > > want
> > > > > > to
> > > > > > > >> make
> > > > > > > >> > > it
> > > > > > > >> > > > > more
> > > > > > > >> > > > > > > general.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Thanks,
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Jun
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
> > > > > > > >> > > > divijvaidya13@gmail.com>
> > > > > > > >> > > > > > > wrote:
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > Thank you Jun and Alex for your comments.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > Point#1: You are right Jun. As Alex mentioned,
> > the
> > > > > > > "derived
> > > > > > > >> > > > metadata"
> > > > > > > >> > > > > > can
> > > > > > > >> > > > > > > > increase the size of cached metadata by a
> factor
> > > of 10
> > > > > > but
> > > > > > > >> it
> > > > > > > >> > > > should
> > > > > > > >> > > > > be
> > > > > > > >> > > > > > > ok
> > > > > > > >> > > > > > > > to cache just the actual metadata. My point
> > about
> > > size
> > > > > > > >> being a
> > > > > > > >> > > > > > limitation
> > > > > > > >> > > > > > > > for using cache is not valid anymore.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > Point#2: For a new replica, it would still
> have
> > to
> > > > > fetch
> > > > > > > the
> > > > > > > >> > > > metadata
> > > > > > > >> > > > > > > over
> > > > > > > >> > > > > > > > the network to initiate the warm up of the
> cache
> > > and
> > > > > > > hence,
> > > > > > > >> > > > increase
> > > > > > > >> > > > > > the
> > > > > > > >> > > > > > > > start time of the archival process. Please
> also
> > > note
> > > > > the
> > > > > > > >> > > > > repercussions
> > > > > > > >> > > > > > of
> > > > > > > >> > > > > > > > the warm up scan that Alex mentioned in this
> > > thread as
> > > > > > > part
> > > > > > > >> of
> > > > > > > >> > > > > #102.2.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that.
> > My
> > > > > point
> > > > > > > >> about
> > > > > > > >> > > size
> > > > > > > >> > > > > > being
> > > > > > > >> > > > > > > a
> > > > > > > >> > > > > > > > limitation for using cache is not valid
> anymore.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > 101#: Alex, if I understand correctly, you are
> > > > > > suggesting
> > > > > > > to
> > > > > > > >> > > cache
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > total size at the leader and update it on
> > > archival.
> > > > > This
> > > > > > > >> > wouldn't
> > > > > > > >> > > > > work
> > > > > > > >> > > > > > > for
> > > > > > > >> > > > > > > > cases when the leader restarts where we would
> > > have to
> > > > > > > make a
> > > > > > > >> > full
> > > > > > > >> > > > > scan
> > > > > > > >> > > > > > > > to update the total size entry on startup. We
> > > expect
> > > > > > users
> > > > > > > >> to
> > > > > > > >> > > store
> > > > > > > >> > > > > > data
> > > > > > > >> > > > > > > > over longer duration in remote storage which
> > > increases
> > > > > > the
> > > > > > > >> > > > likelihood
> > > > > > > >> > > > > > of
> > > > > > > >> > > > > > > > leader restarts / failovers.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > 102#.1: I don't think that the current design
> > > > > > accommodates
> > > > > > > >> the
> > > > > > > >> > > fact
> > > > > > > >> > > > > > that
> > > > > > > >> > > > > > > > data corruption could happen at the RLMM
> plugin
> > > (we
> > > > > > don't
> > > > > > > >> have
> > > > > > > >> > > > > checksum
> > > > > > > >> > > > > > > as
> > > > > > > >> > > > > > > > a field in metadata as part of KIP405). If
> data
> > > > > > corruption
> > > > > > > >> > > occurs,
> > > > > > > >> > > > w/
> > > > > > > >> > > > > > or
> > > > > > > >> > > > > > > > w/o the cache, it would be a different problem
> > to
> > > > > > solve. I
> > > > > > > >> > would
> > > > > > > >> > > > like
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > > keep this outside the scope of this KIP.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > 102#.2: Agree. This remains as the main
> concern
> > > for
> > > > > > using
> > > > > > > >> the
> > > > > > > >> > > cache
> > > > > > > >> > > > > to
> > > > > > > >> > > > > > > > fetch total size.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > Regards,
> > > > > > > >> > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre
> > > Dupriez <
> > > > > > > >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > > Hi Divij,
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Thanks for the KIP. Please find some
> comments
> > > based
> > > > > on
> > > > > > > >> what I
> > > > > > > >> > > > read
> > > > > > > >> > > > > on
> > > > > > > >> > > > > > > > > this thread so far - apologies for the
> repeats
> > > and
> > > > > the
> > > > > > > >> late
> > > > > > > >> > > > reply.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > If I understand correctly, one of the main
> > > elements
> > > > > of
> > > > > > > >> > > discussion
> > > > > > > >> > > > > is
> > > > > > > >> > > > > > > > > about caching in Kafka versus delegation of
> > > > > providing
> > > > > > > the
> > > > > > > >> > > remote
> > > > > > > >> > > > > size
> > > > > > > >> > > > > > > > > of a topic-partition to the plugin.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > A few comments:
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > 100. The size of the “derived metadata”
> which
> > is
> > > > > > managed
> > > > > > > >> by
> > > > > > > >> > the
> > > > > > > >> > > > > > plugin
> > > > > > > >> > > > > > > > > to represent an rlmMetadata can indeed be
> > close
> > > to 1
> > > > > > kB
> > > > > > > on
> > > > > > > >> > > > average
> > > > > > > >> > > > > > > > > depending on its own internal structure,
> e.g.
> > > the
> > > > > > > >> redundancy
> > > > > > > >> > it
> > > > > > > >> > > > > > > > > enforces (unfortunately resulting to
> > > duplication),
> > > > > > > >> additional
> > > > > > > >> > > > > > > > > information such as checksums and primary
> and
> > > > > > secondary
> > > > > > > >> > > indexable
> > > > > > > >> > > > > > > > > keys. But indeed, the rlmMetadata is itself
> a
> > > > > lighter
> > > > > > > data
> > > > > > > >> > > > > structure
> > > > > > > >> > > > > > > > > by a factor of 10. And indeed, instead of
> > > caching
> > > > > the
> > > > > > > >> > “derived
> > > > > > > >> > > > > > > > > metadata”, only the rlmMetadata could be,
> > which
> > > > > should
> > > > > > > >> > address
> > > > > > > >> > > > the
> > > > > > > >> > > > > > > > > concern regarding the memory occupancy of
> the
> > > cache.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > 101. I am not sure I fully understand why we
> > > would
> > > > > > need
> > > > > > > to
> > > > > > > >> > > cache
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > > list of rlmMetadata to retain the remote
> size
> > > of a
> > > > > > > >> > > > topic-partition.
> > > > > > > >> > > > > > > > > Since the leader of a topic-partition is, in
> > > > > > > >> non-degenerated
> > > > > > > >> > > > cases,
> > > > > > > >> > > > > > > > > the only actor which can mutate the remote
> > part
> > > of
> > > > > the
> > > > > > > >> > > > > > > > > topic-partition, hence its size, it could in
> > > theory
> > > > > > only
> > > > > > > >> > cache
> > > > > > > >> > > > the
> > > > > > > >> > > > > > > > > size of the remote log once it has
> calculated
> > > it? In
> > > > > > > which
> > > > > > > >> > case
> > > > > > > >> > > > > there
> > > > > > > >> > > > > > > > > would not be any problem regarding the size
> of
> > > the
> > > > > > > caching
> > > > > > > >> > > > > strategy.
> > > > > > > >> > > > > > > > > Did I miss something there?
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > 102. There may be a few challenges to
> consider
> > > with
> > > > > > > >> caching:
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > 102.1) As mentioned above, the caching
> > strategy
> > > > > > assumes
> > > > > > > no
> > > > > > > >> > > > mutation
> > > > > > > >> > > > > > > > > outside the lifetime of a leader. While this
> > is
> > > true
> > > > > > in
> > > > > > > >> the
> > > > > > > >> > > > normal
> > > > > > > >> > > > > > > > > course of operation, there could be
> accidental
> > > > > > mutation
> > > > > > > >> > outside
> > > > > > > >> > > > of
> > > > > > > >> > > > > > the
> > > > > > > >> > > > > > > > > leader and a loss of consistency between the
> > > cached
> > > > > > > state
> > > > > > > >> and
> > > > > > > >> > > the
> > > > > > > >> > > > > > > > > actual remote representation of the log.
> E.g.
> > > > > > > split-brain
> > > > > > > >> > > > > scenarios,
> > > > > > > >> > > > > > > > > bugs in the plugins, bugs in external
> systems
> > > with
> > > > > > > >> mutating
> > > > > > > >> > > > access
> > > > > > > >> > > > > on
> > > > > > > >> > > > > > > > > the derived metadata. In the worst case, a
> > drift
> > > > > > between
> > > > > > > >> the
> > > > > > > >> > > > cached
> > > > > > > >> > > > > > > > > size and the actual size could lead to
> > > over-deleting
> > > > > > > >> remote
> > > > > > > >> > > data
> > > > > > > >> > > > > > which
> > > > > > > >> > > > > > > > > is a durability risk.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > The alternative you propose, by making the
> > > plugin
> > > > > the
> > > > > > > >> source
> > > > > > > >> > of
> > > > > > > >> > > > > truth
> > > > > > > >> > > > > > > > > w.r.t. to the size of the remote log, can
> make
> > > it
> > > > > > easier
> > > > > > > >> to
> > > > > > > >> > > avoid
> > > > > > > >> > > > > > > > > inconsistencies between plugin-managed
> > metadata
> > > and
> > > > > > the
> > > > > > > >> > remote
> > > > > > > >> > > > log
> > > > > > > >> > > > > > > > > from the perspective of Kafka. On the other
> > > hand,
> > > > > > plugin
> > > > > > > >> > > vendors
> > > > > > > >> > > > > > would
> > > > > > > >> > > > > > > > > have to implement it with the expected
> > > efficiency to
> > > > > > > have
> > > > > > > >> it
> > > > > > > >> > > > yield
> > > > > > > >> > > > > > > > > benefits.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > 102.2) As you mentioned, the caching
> strategy
> > in
> > > > > Kafka
> > > > > > > >> would
> > > > > > > >> > > > still
> > > > > > > >> > > > > > > > > require one iteration over the list of
> > > rlmMetadata
> > > > > > when
> > > > > > > >> the
> > > > > > > >> > > > > > leadership
> > > > > > > >> > > > > > > > > of a topic-partition is assigned to a
> broker,
> > > while
> > > > > > the
> > > > > > > >> > plugin
> > > > > > > >> > > > can
> > > > > > > >> > > > > > > > > offer alternative constant-time approaches.
> > This
> > > > > > > >> calculation
> > > > > > > >> > > > cannot
> > > > > > > >> > > > > > be
> > > > > > > >> > > > > > > > > put on the LeaderAndIsr path and would be
> > > performed
> > > > > in
> > > > > > > the
> > > > > > > >> > > > > > background.
> > > > > > > >> > > > > > > > > In case of bulk leadership migration,
> listing
> > > the
> > > > > > > >> rlmMetadata
> > > > > > > >> > > > could
> > > > > > > >> > > > > > a)
> > > > > > > >> > > > > > > > > result in request bursts to any backend
> system
> > > the
> > > > > > > plugin
> > > > > > > >> may
> > > > > > > >> > > use
> > > > > > > >> > > > > > > > > [which shouldn’t be a problem for
> > > high-throughput
> > > > > data
> > > > > > > >> stores
> > > > > > > >> > > but
> > > > > > > >> > > > > > > > > could have cost implications] b) increase
> > > > > utilisation
> > > > > > > >> > timespan
> > > > > > > >> > > of
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > > RLM threads for these calculations
> potentially
> > > > > leading
> > > > > > > to
> > > > > > > >> > > > transient
> > > > > > > >> > > > > > > > > starvation of tasks queued for, typically,
> > > > > offloading
> > > > > > > >> > > operations
> > > > > > > >> > > > c)
> > > > > > > >> > > > > > > > > could have a non-marginal CPU footprint on
> > > hardware
> > > > > > with
> > > > > > > >> > strict
> > > > > > > >> > > > > > > > > resource constraints. All these elements
> could
> > > have
> > > > > an
> > > > > > > >> impact
> > > > > > > >> > > to
> > > > > > > >> > > > > some
> > > > > > > >> > > > > > > > > degree depending on the operational
> > environment.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > From a design perspective, one question is
> > > where we
> > > > > > want
> > > > > > > >> the
> > > > > > > >> > > > source
> > > > > > > >> > > > > > of
> > > > > > > >> > > > > > > > > truth w.r.t. remote log size to be during
> the
> > > > > lifetime
> > > > > > > of
> > > > > > > >> a
> > > > > > > >> > > > leader.
> > > > > > > >> > > > > > > > > The responsibility of maintaining a
> consistent
> > > > > > > >> representation
> > > > > > > >> > > of
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > > remote log is shared by Kafka and the
> plugin.
> > > Which
> > > > > > > >> system is
> > > > > > > >> > > > best
> > > > > > > >> > > > > > > > > placed to maintain such a state while
> > providing
> > > the
> > > > > > > >> highest
> > > > > > > >> > > > > > > > > consistency guarantees is something both
> Kafka
> > > and
> > > > > > > plugin
> > > > > > > >> > > > designers
> > > > > > > >> > > > > > > > > could help understand better.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Many thanks,
> > > > > > > >> > > > > > > > > Alexandre
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > > > > > > >> > <jun@confluent.io.invalid
> > > > > > > >> > > >
> > > > > > > >> > > > a
> > > > > > > >> > > > > > > > écrit :
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Hi, Divij,
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Thanks for the reply.
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Point #1. Is the average remote segment
> > > metadata
> > > > > > > really
> > > > > > > >> > 1KB?
> > > > > > > >> > > > > What's
> > > > > > > >> > > > > > > > > listed
> > > > > > > >> > > > > > > > > > in the public interface is probably well
> > > below 100
> > > > > > > >> bytes.
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Point #2. I guess you are assuming that
> each
> > > > > broker
> > > > > > > only
> > > > > > > >> > > caches
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > > remote
> > > > > > > >> > > > > > > > > > segment metadata in memory. An alternative
> > > > > approach
> > > > > > is
> > > > > > > >> to
> > > > > > > >> > > cache
> > > > > > > >> > > > > > them
> > > > > > > >> > > > > > > in
> > > > > > > >> > > > > > > > > > both memory and local disk. That way, on
> > > broker
> > > > > > > restart,
> > > > > > > >> > you
> > > > > > > >> > > > just
> > > > > > > >> > > > > > > need
> > > > > > > >> > > > > > > > to
> > > > > > > >> > > > > > > > > > fetch the new remote segments' metadata
> > using
> > > the
> > > > > > > >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > topicIdPartition,
> > > > > > > >> > int
> > > > > > > >> > > > > > > > leaderEpoch)
> > > > > > > >> > > > > > > > > > api. Will that work?
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Point #3. Thanks for the explanation and
> it
> > > sounds
> > > > > > > good.
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Thanks,
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Jun
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij
> > Vaidya <
> > > > > > > >> > > > > > > divijvaidya13@gmail.com>
> > > > > > > >> > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > > Hi Jun
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > There are three points that I would like
> > to
> > > > > > present
> > > > > > > >> here:
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > 1. We would require a large cache size
> to
> > > > > > > efficiently
> > > > > > > >> > cache
> > > > > > > >> > > > all
> > > > > > > >> > > > > > > > segment
> > > > > > > >> > > > > > > > > > > metadata.
> > > > > > > >> > > > > > > > > > > 2. Linear scan of all metadata at broker
> > > startup
> > > > > > to
> > > > > > > >> > > populate
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > cache
> > > > > > > >> > > > > > > > > will
> > > > > > > >> > > > > > > > > > > be slow and will impact the archival
> > > process.
> > > > > > > >> > > > > > > > > > > 3. There is no other use case where a
> full
> > > scan
> > > > > of
> > > > > > > >> > segment
> > > > > > > >> > > > > > metadata
> > > > > > > >> > > > > > > > is
> > > > > > > >> > > > > > > > > > > required.
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > Let's start by quantifying 1. Here's my
> > > estimate
> > > > > > for
> > > > > > > >> the
> > > > > > > >> > > size
> > > > > > > >> > > > > of
> > > > > > > >> > > > > > > the
> > > > > > > >> > > > > > > > > cache.
> > > > > > > >> > > > > > > > > > > Average size of segment metadata = 1KB.
> > This
> > > > > could
> > > > > > > be
> > > > > > > >> > more
> > > > > > > >> > > if
> > > > > > > >> > > > > we
> > > > > > > >> > > > > > > have
> > > > > > > >> > > > > > > > > > > frequent leader failover with a large
> > > number of
> > > > > > > leader
> > > > > > > >> > > epochs
> > > > > > > >> > > > > > being
> > > > > > > >> > > > > > > > > stored
> > > > > > > >> > > > > > > > > > > per segment.
> > > > > > > >> > > > > > > > > > > Segment size = 100MB. Users will prefer
> to
> > > > > reduce
> > > > > > > the
> > > > > > > >> > > segment
> > > > > > > >> > > > > > size
> > > > > > > >> > > > > > > > > from the
> > > > > > > >> > > > > > > > > > > default value of 1GB to ensure timely
> > > archival
> > > > > of
> > > > > > > data
> > > > > > > >> > > since
> > > > > > > >> > > > > data
> > > > > > > >> > > > > > > > from
> > > > > > > >> > > > > > > > > > > active segment is not archived.
> > > > > > > >> > > > > > > > > > > Cache size = num segments * avg. segment
> > > > > metadata
> > > > > > > >> size =
> > > > > > > >> > > > > > > > > (100TB/100MB)*1KB
> > > > > > > >> > > > > > > > > > > = 1GB.
> > > > > > > >> > > > > > > > > > > While 1GB for cache may not sound like a
> > > large
> > > > > > > number
> > > > > > > >> for
> > > > > > > >> > > > > larger
> > > > > > > >> > > > > > > > > machines,
> > > > > > > >> > > > > > > > > > > it does eat into the memory as an
> > additional
> > > > > cache
> > > > > > > and
> > > > > > > >> > > makes
> > > > > > > >> > > > > use
> > > > > > > >> > > > > > > > cases
> > > > > > > >> > > > > > > > > with
> > > > > > > >> > > > > > > > > > > large data retention with low throughout
> > > > > expensive
> > > > > > > >> (where
> > > > > > > >> > > > such
> > > > > > > >> > > > > > use
> > > > > > > >> > > > > > > > case
> > > > > > > >> > > > > > > > > > > would could use smaller machines).
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > About point#2:
> > > > > > > >> > > > > > > > > > > Even if we say that all segment metadata
> > > can fit
> > > > > > > into
> > > > > > > >> the
> > > > > > > >> > > > > cache,
> > > > > > > >> > > > > > we
> > > > > > > >> > > > > > > > > will
> > > > > > > >> > > > > > > > > > > need to populate the cache on broker
> > > startup. It
> > > > > > > would
> > > > > > > >> > not
> > > > > > > >> > > be
> > > > > > > >> > > > > in
> > > > > > > >> > > > > > > the
> > > > > > > >> > > > > > > > > > > critical patch of broker startup and
> hence
> > > won't
> > > > > > > >> impact
> > > > > > > >> > the
> > > > > > > >> > > > > > startup
> > > > > > > >> > > > > > > > > time.
> > > > > > > >> > > > > > > > > > > But it will impact the time when we
> could
> > > start
> > > > > > the
> > > > > > > >> > > archival
> > > > > > > >> > > > > > > process
> > > > > > > >> > > > > > > > > since
> > > > > > > >> > > > > > > > > > > the RLM thread pool will be blocked on
> the
> > > first
> > > > > > > call
> > > > > > > >> to
> > > > > > > >> > > > > > > > > > > listRemoteLogSegments(). To scan
> metadata
> > > for
> > > > > 1MM
> > > > > > > >> > segments
> > > > > > > >> > > > > > > (computed
> > > > > > > >> > > > > > > > > above)
> > > > > > > >> > > > > > > > > > > and transfer 1GB data over the network
> > from
> > > a
> > > > > RLMM
> > > > > > > >> such
> > > > > > > >> > as
> > > > > > > >> > > a
> > > > > > > >> > > > > > remote
> > > > > > > >> > > > > > > > > > > database would be in the order of
> minutes
> > > > > > (depending
> > > > > > > >> on
> > > > > > > >> > how
> > > > > > > >> > > > > > > efficient
> > > > > > > >> > > > > > > > > the
> > > > > > > >> > > > > > > > > > > scan is with the RLMM implementation).
> > > > > Although, I
> > > > > > > >> would
> > > > > > > >> > > > > concede
> > > > > > > >> > > > > > > that
> > > > > > > >> > > > > > > > > > > having RLM threads blocked for a few
> > > minutes is
> > > > > > > >> perhaps
> > > > > > > >> > OK
> > > > > > > >> > > > but
> > > > > > > >> > > > > if
> > > > > > > >> > > > > > > we
> > > > > > > >> > > > > > > > > > > introduce the new API proposed in the
> KIP,
> > > we
> > > > > > would
> > > > > > > >> have
> > > > > > > >> > a
> > > > > > > >> > > > > > > > > > > deterministic startup time for RLM.
> Adding
> > > the
> > > > > API
> > > > > > > >> comes
> > > > > > > >> > > at a
> > > > > > > >> > > > > low
> > > > > > > >> > > > > > > > cost
> > > > > > > >> > > > > > > > > and
> > > > > > > >> > > > > > > > > > > I believe the trade off is worth it.
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > About point#3:
> > > > > > > >> > > > > > > > > > > We can use
> > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > >> > > > > > topicIdPartition,
> > > > > > > >> > > > > > > > int
> > > > > > > >> > > > > > > > > > > leaderEpoch) to calculate the segments
> > > eligible
> > > > > > for
> > > > > > > >> > > deletion
> > > > > > > >> > > > > > (based
> > > > > > > >> > > > > > > > on
> > > > > > > >> > > > > > > > > size
> > > > > > > >> > > > > > > > > > > retention) where leader epoch(s) belong
> to
> > > the
> > > > > > > current
> > > > > > > >> > > leader
> > > > > > > >> > > > > > epoch
> > > > > > > >> > > > > > > > > chain.
> > > > > > > >> > > > > > > > > > > I understand that it may lead to
> segments
> > > > > > belonging
> > > > > > > to
> > > > > > > >> > > other
> > > > > > > >> > > > > > epoch
> > > > > > > >> > > > > > > > > lineage
> > > > > > > >> > > > > > > > > > > not getting deleted and would require a
> > > separate
> > > > > > > >> > mechanism
> > > > > > > >> > > to
> > > > > > > >> > > > > > > delete
> > > > > > > >> > > > > > > > > them.
> > > > > > > >> > > > > > > > > > > The separate mechanism would anyways be
> > > required
> > > > > > to
> > > > > > > >> > delete
> > > > > > > >> > > > > these
> > > > > > > >> > > > > > > > > "leaked"
> > > > > > > >> > > > > > > > > > > segments as there are other cases which
> > > could
> > > > > lead
> > > > > > > to
> > > > > > > >> > leaks
> > > > > > > >> > > > > such
> > > > > > > >> > > > > > as
> > > > > > > >> > > > > > > > > network
> > > > > > > >> > > > > > > > > > > problems with RSM mid way writing
> through.
> > > > > segment
> > > > > > > >> etc.
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > Thank you for the replies so far. They
> > have
> > > made
> > > > > > me
> > > > > > > >> > > re-think
> > > > > > > >> > > > my
> > > > > > > >> > > > > > > > > assumptions
> > > > > > > >> > > > > > > > > > > and this dialogue has been very
> > > constructive for
> > > > > > me.
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > Regards,
> > > > > > > >> > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
> > > > > > > >> > > > > > <jun@confluent.io.invalid
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > Hi, Divij,
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > Thanks for the reply.
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > It's true that the data in Kafka could
> > be
> > > kept
> > > > > > > >> longer
> > > > > > > >> > > with
> > > > > > > >> > > > > > > KIP-405.
> > > > > > > >> > > > > > > > > How
> > > > > > > >> > > > > > > > > > > > much data do you envision to have per
> > > broker?
> > > > > > For
> > > > > > > >> 100TB
> > > > > > > >> > > > data
> > > > > > > >> > > > > > per
> > > > > > > >> > > > > > > > > broker,
> > > > > > > >> > > > > > > > > > > > with 1GB segment and segment metadata
> of
> > > 100
> > > > > > > bytes,
> > > > > > > >> it
> > > > > > > >> > > > > requires
> > > > > > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit
> > in
> > > > > > memory.
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > > > > > > >> > listRemoteLogSegments()
> > > > > > > >> > > > > > methods.
> > > > > > > >> > > > > > > > > The one
> > > > > > > >> > > > > > > > > > > > you listed
> > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > > >> > > > > > > topicIdPartition,
> > > > > > > >> > > > > > > > > int
> > > > > > > >> > > > > > > > > > > > leaderEpoch) does return data in
> offset
> > > order.
> > > > > > > >> However,
> > > > > > > >> > > the
> > > > > > > >> > > > > > other
> > > > > > > >> > > > > > > > > > > > one
> > listRemoteLogSegments(TopicIdPartition
> > > > > > > >> > > > topicIdPartition)
> > > > > > > >> > > > > > > > doesn't
> > > > > > > >> > > > > > > > > > > > specify the return order. I assume
> that
> > > you
> > > > > need
> > > > > > > the
> > > > > > > >> > > latter
> > > > > > > >> > > > > to
> > > > > > > >> > > > > > > > > calculate
> > > > > > > >> > > > > > > > > > > > the segment size?
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > Thanks,
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > Jun
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij
> > > Vaidya
> > > > > <
> > > > > > > >> > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > >> > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > *Jun,*
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > *"the default implementation of RLMM
> > > does
> > > > > > local
> > > > > > > >> > > caching,
> > > > > > > >> > > > > > > right?"*
> > > > > > > >> > > > > > > > > > > > > Yes, Jun. The default implementation
> > of
> > > RLMM
> > > > > > > does
> > > > > > > >> > > indeed
> > > > > > > >> > > > > > cache
> > > > > > > >> > > > > > > > the
> > > > > > > >> > > > > > > > > > > > segment
> > > > > > > >> > > > > > > > > > > > > metadata today, hence, it won't work
> > > for use
> > > > > > > cases
> > > > > > > >> > when
> > > > > > > >> > > > the
> > > > > > > >> > > > > > > > number
> > > > > > > >> > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > segments in remote storage is large
> > > enough
> > > > > to
> > > > > > > >> exceed
> > > > > > > >> > > the
> > > > > > > >> > > > > size
> > > > > > > >> > > > > > > of
> > > > > > > >> > > > > > > > > cache.
> > > > > > > >> > > > > > > > > > > > As
> > > > > > > >> > > > > > > > > > > > > part of this KIP, I will implement
> the
> > > new
> > > > > > > >> proposed
> > > > > > > >> > API
> > > > > > > >> > > > in
> > > > > > > >> > > > > > the
> > > > > > > >> > > > > > > > > default
> > > > > > > >> > > > > > > > > > > > > implementation of RLMM but the
> > > underlying
> > > > > > > >> > > implementation
> > > > > > > >> > > > > will
> > > > > > > >> > > > > > > > > still be
> > > > > > > >> > > > > > > > > > > a
> > > > > > > >> > > > > > > > > > > > > scan. I will pick up optimizing that
> > in
> > > a
> > > > > > > separate
> > > > > > > >> > PR.
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > *"we also cache all segment metadata
> > in
> > > the
> > > > > > > >> brokers
> > > > > > > >> > > > without
> > > > > > > >> > > > > > > > > KIP-405. Do
> > > > > > > >> > > > > > > > > > > > you
> > > > > > > >> > > > > > > > > > > > > see a need to change that?"*
> > > > > > > >> > > > > > > > > > > > > Please correct me if I am wrong here
> > > but we
> > > > > > > cache
> > > > > > > >> > > > metadata
> > > > > > > >> > > > > > for
> > > > > > > >> > > > > > > > > segments
> > > > > > > >> > > > > > > > > > > > > "residing in local storage". The
> size
> > > of the
> > > > > > > >> current
> > > > > > > >> > > > cache
> > > > > > > >> > > > > > > works
> > > > > > > >> > > > > > > > > fine
> > > > > > > >> > > > > > > > > > > for
> > > > > > > >> > > > > > > > > > > > > the scale of the number of segments
> > > that we
> > > > > > > >> expect to
> > > > > > > >> > > > store
> > > > > > > >> > > > > > in
> > > > > > > >> > > > > > > > > local
> > > > > > > >> > > > > > > > > > > > > storage. After KIP-405, that cache
> > will
> > > > > > continue
> > > > > > > >> to
> > > > > > > >> > > store
> > > > > > > >> > > > > > > > metadata
> > > > > > > >> > > > > > > > > for
> > > > > > > >> > > > > > > > > > > > > segments which are residing in local
> > > storage
> > > > > > and
> > > > > > > >> > hence,
> > > > > > > >> > > > we
> > > > > > > >> > > > > > > don't
> > > > > > > >> > > > > > > > > need
> > > > > > > >> > > > > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > change that. For segments which have
> > > been
> > > > > > > >> offloaded
> > > > > > > >> > to
> > > > > > > >> > > > > remote
> > > > > > > >> > > > > > > > > storage,
> > > > > > > >> > > > > > > > > > > it
> > > > > > > >> > > > > > > > > > > > > would rely on RLMM. Note that the
> > scale
> > > of
> > > > > > data
> > > > > > > >> > stored
> > > > > > > >> > > in
> > > > > > > >> > > > > > RLMM
> > > > > > > >> > > > > > > is
> > > > > > > >> > > > > > > > > > > > different
> > > > > > > >> > > > > > > > > > > > > from local cache because the number
> of
> > > > > > segments
> > > > > > > is
> > > > > > > >> > > > expected
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > be
> > > > > > > >> > > > > > > > > much
> > > > > > > >> > > > > > > > > > > > > larger than what current
> > implementation
> > > > > stores
> > > > > > > in
> > > > > > > >> > local
> > > > > > > >> > > > > > > storage.
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > 2,3,4:
> > > > > > > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > does
> > > > > > > >> > > > > > > > > specify
> > > > > > > >> > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > order i.e. it returns the segments
> > > sorted by
> > > > > > > first
> > > > > > > >> > > offset
> > > > > > > >> > > > > in
> > > > > > > >> > > > > > > > > ascending
> > > > > > > >> > > > > > > > > > > > > order. I am copying the API docs for
> > > KIP-405
> > > > > > > here
> > > > > > > >> for
> > > > > > > >> > > > your
> > > > > > > >> > > > > > > > > reference
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > *Returns iterator of remote log
> > segment
> > > > > > > metadata,
> > > > > > > >> > > sorted
> > > > > > > >> > > > by
> > > > > > > >> > > > > > > > {@link
> > > > > > > >> > > > > > > > > > > > >
> > RemoteLogSegmentMetadata#startOffset()}
> > > > > > > >> inascending
> > > > > > > >> > > order
> > > > > > > >> > > > > > which
> > > > > > > >> > > > > > > > > > > contains
> > > > > > > >> > > > > > > > > > > > > the given leader epoch. This is used
> > by
> > > > > remote
> > > > > > > log
> > > > > > > >> > > > > retention
> > > > > > > >> > > > > > > > > management
> > > > > > > >> > > > > > > > > > > > > subsystemto fetch the segment
> metadata
> > > for a
> > > > > > > given
> > > > > > > >> > > leader
> > > > > > > >> > > > > > > > > epoch.@param
> > > > > > > >> > > > > > > > > > > > > topicIdPartition topic
> partition@param
> > > > > > > >> leaderEpoch
> > > > > > > >> > > > > > leader
> > > > > > > >> > > > > > > > > > > > > epoch@return
> > > > > > > >> > > > > > > > > > > > > Iterator of remote segments, sorted
> by
> > > start
> > > > > > > >> offset
> > > > > > > >> > in
> > > > > > > >> > > > > > > ascending
> > > > > > > >> > > > > > > > > > > order. *
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > *Luke,*
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > 5. Note that we are trying to
> optimize
> > > the
> > > > > > > >> efficiency
> > > > > > > >> > > of
> > > > > > > >> > > > > size
> > > > > > > >> > > > > > > > based
> > > > > > > >> > > > > > > > > > > > > retention for remote storage.
> KIP-405
> > > does
> > > > > not
> > > > > > > >> > > introduce
> > > > > > > >> > > > a
> > > > > > > >> > > > > > new
> > > > > > > >> > > > > > > > > config
> > > > > > > >> > > > > > > > > > > for
> > > > > > > >> > > > > > > > > > > > > periodically checking remote similar
> > to
> > > > > > > >> > > > > > > > > > > log.retention.check.interval.ms
> > > > > > > >> > > > > > > > > > > > > which is applicable for remote
> > storage.
> > > > > Hence,
> > > > > > > the
> > > > > > > >> > > metric
> > > > > > > >> > > > > > will
> > > > > > > >> > > > > > > be
> > > > > > > >> > > > > > > > > > > updated
> > > > > > > >> > > > > > > > > > > > > at the time of invoking log
> retention
> > > check
> > > > > > for
> > > > > > > >> > remote
> > > > > > > >> > > > tier
> > > > > > > >> > > > > > > which
> > > > > > > >> > > > > > > > > is
> > > > > > > >> > > > > > > > > > > > > pending implementation today. We can
> > > perhaps
> > > > > > > come
> > > > > > > >> > back
> > > > > > > >> > > > and
> > > > > > > >> > > > > > > update
> > > > > > > >> > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > metric description after the
> > > implementation
> > > > > of
> > > > > > > log
> > > > > > > >> > > > > retention
> > > > > > > >> > > > > > > > check
> > > > > > > >> > > > > > > > > in
> > > > > > > >> > > > > > > > > > > > > RemoteLogManager.
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > --
> > > > > > > >> > > > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke
> > > Chen <
> > > > > > > >> > > > > showuon@gmail.com
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > Hi Divij,
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > One more question about the
> metric:
> > > > > > > >> > > > > > > > > > > > > > I think the metric will be updated
> > > when
> > > > > > > >> > > > > > > > > > > > > > (1) each time we run the log
> > retention
> > > > > check
> > > > > > > >> (that
> > > > > > > >> > > is,
> > > > > > > >> > > > > > > > > > > > > > log.retention.check.interval.ms)
> > > > > > > >> > > > > > > > > > > > > > (2) When user explicitly call
> > > > > > getRemoteLogSize
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > Is that correct?
> > > > > > > >> > > > > > > > > > > > > > Maybe we should add a note in
> metric
> > > > > > > >> description,
> > > > > > > >> > > > > > otherwise,
> > > > > > > >> > > > > > > > when
> > > > > > > >> > > > > > > > > > > user
> > > > > > > >> > > > > > > > > > > > > got,
> > > > > > > >> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes,
> > > will be
> > > > > > > >> > surprised.
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > Otherwise, LGTM
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > Thank you for the KIP
> > > > > > > >> > > > > > > > > > > > > > Luke
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM
> Jun
> > > Rao
> > > > > > > >> > > > > > > > <jun@confluent.io.invalid
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > Thanks for the explanation.
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > 1. Hmm, the default
> implementation
> > > of
> > > > > RLMM
> > > > > > > >> does
> > > > > > > >> > > local
> > > > > > > >> > > > > > > > caching,
> > > > > > > >> > > > > > > > > > > right?
> > > > > > > >> > > > > > > > > > > > > > > Currently, we also cache all
> > segment
> > > > > > > metadata
> > > > > > > >> in
> > > > > > > >> > > the
> > > > > > > >> > > > > > > brokers
> > > > > > > >> > > > > > > > > > > without
> > > > > > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need to
> > change
> > > > > that?
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation
> makes
> > > > > sense.
> > > > > > > >> > However,
> > > > > > > >> > > > > > > > > > > > > > > currently,
> > > > > > > >> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > > > > > doesn't
> > > > > > > >> > > > > > > > > > > > > > specify
> > > > > > > >> > > > > > > > > > > > > > > a particular order of the
> > iterator.
> > > Do
> > > > > you
> > > > > > > >> intend
> > > > > > > >> > > to
> > > > > > > >> > > > > > change
> > > > > > > >> > > > > > > > > that?
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > Thanks,
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > Jun
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM
> > Divij
> > > > > > Vaidya
> > > > > > > <
> > > > > > > >> > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > Hey Jun
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > Thank you for your comments.
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor could
> > ensure
> > > > > that
> > > > > > > >> > > > > > > > > listRemoteLogSegments()
> > > > > > > >> > > > > > > > > > > is
> > > > > > > >> > > > > > > > > > > > > > fast"*
> > > > > > > >> > > > > > > > > > > > > > > > This would be ideal but
> > > pragmatically,
> > > > > > it
> > > > > > > is
> > > > > > > >> > > > > difficult
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > > > ensure
> > > > > > > >> > > > > > > > > > > > that
> > > > > > > >> > > > > > > > > > > > > > > > listRemoteLogSegments() is
> fast.
> > > This
> > > > > is
> > > > > > > >> > because
> > > > > > > >> > > of
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > > > > possibility
> > > > > > > >> > > > > > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > > a
> > > > > > > >> > > > > > > > > > > > > > > > large number of segments (much
> > > larger
> > > > > > than
> > > > > > > >> what
> > > > > > > >> > > > Kafka
> > > > > > > >> > > > > > > > > currently
> > > > > > > >> > > > > > > > > > > > > handles
> > > > > > > >> > > > > > > > > > > > > > > > with local storage today)
> would
> > > make
> > > > > it
> > > > > > > >> > > infeasible
> > > > > > > >> > > > to
> > > > > > > >> > > > > > > adopt
> > > > > > > >> > > > > > > > > > > > > strategies
> > > > > > > >> > > > > > > > > > > > > > > such
> > > > > > > >> > > > > > > > > > > > > > > > as local caching to improve
> the
> > > > > > > performance
> > > > > > > >> of
> > > > > > > >> > > > > > > > > > > > listRemoteLogSegments.
> > > > > > > >> > > > > > > > > > > > > > > Apart
> > > > > > > >> > > > > > > > > > > > > > > > from caching (which won't work
> > > due to
> > > > > > size
> > > > > > > >> > > > > > limitations) I
> > > > > > > >> > > > > > > > > can't
> > > > > > > >> > > > > > > > > > > > think
> > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > > > > other strategies which may
> > > eliminate
> > > > > the
> > > > > > > >> need
> > > > > > > >> > for
> > > > > > > >> > > > IO
> > > > > > > >> > > > > > > > > > > > > > > > operations proportional to the
> > > number
> > > > > of
> > > > > > > >> total
> > > > > > > >> > > > > > segments.
> > > > > > > >> > > > > > > > > Please
> > > > > > > >> > > > > > > > > > > > > advise
> > > > > > > >> > > > > > > > > > > > > > if
> > > > > > > >> > > > > > > > > > > > > > > > you have something in mind.
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds the
> > > > > retention
> > > > > > > >> size,
> > > > > > > >> > we
> > > > > > > >> > > > need
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > > > > > determine
> > > > > > > >> > > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > > subset of segments to delete
> to
> > > bring
> > > > > > the
> > > > > > > >> size
> > > > > > > >> > > > within
> > > > > > > >> > > > > > the
> > > > > > > >> > > > > > > > > > > retention
> > > > > > > >> > > > > > > > > > > > > > > limit.
> > > > > > > >> > > > > > > > > > > > > > > > Do we need to call
> > > > > > > >> > > > > > > > > > >
> > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > > > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > > > > determine that?"*
> > > > > > > >> > > > > > > > > > > > > > > > Yes, we need to call
> > > > > > > >> listRemoteLogSegments() to
> > > > > > > >> > > > > > determine
> > > > > > > >> > > > > > > > > which
> > > > > > > >> > > > > > > > > > > > > > segments
> > > > > > > >> > > > > > > > > > > > > > > > should be deleted. But there
> is
> > a
> > > > > > > difference
> > > > > > > >> > with
> > > > > > > >> > > > the
> > > > > > > >> > > > > > use
> > > > > > > >> > > > > > > > > case we
> > > > > > > >> > > > > > > > > > > > are
> > > > > > > >> > > > > > > > > > > > > > > > trying to optimize with this
> > KIP.
> > > To
> > > > > > > >> determine
> > > > > > > >> > > the
> > > > > > > >> > > > > > subset
> > > > > > > >> > > > > > > > of
> > > > > > > >> > > > > > > > > > > > segments
> > > > > > > >> > > > > > > > > > > > > > > which
> > > > > > > >> > > > > > > > > > > > > > > > would be deleted, we only read
> > > > > metadata
> > > > > > > for
> > > > > > > >> > > > segments
> > > > > > > >> > > > > > > which
> > > > > > > >> > > > > > > > > would
> > > > > > > >> > > > > > > > > > > be
> > > > > > > >> > > > > > > > > > > > > > > deleted
> > > > > > > >> > > > > > > > > > > > > > > > via the
> listRemoteLogSegments().
> > > But
> > > > > to
> > > > > > > >> > determine
> > > > > > > >> > > > the
> > > > > > > >> > > > > > > > > > > totalLogSize,
> > > > > > > >> > > > > > > > > > > > > > which
> > > > > > > >> > > > > > > > > > > > > > > > is required every time
> retention
> > > logic
> > > > > > > >> based on
> > > > > > > >> > > > size
> > > > > > > >> > > > > > > > > executes, we
> > > > > > > >> > > > > > > > > > > > > read
> > > > > > > >> > > > > > > > > > > > > > > > metadata of *all* the segments
> > in
> > > > > remote
> > > > > > > >> > storage.
> > > > > > > >> > > > > > Hence,
> > > > > > > >> > > > > > > > the
> > > > > > > >> > > > > > > > > > > number
> > > > > > > >> > > > > > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > > > > results returned by
> > > > > > > >> > > > > > > > > > > >
> > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > > > > > > > > > > *is
> > > > > > > >> > > > > > > > > > > > > > > > different when we are
> > calculating
> > > > > > > >> totalLogSize
> > > > > > > >> > > vs.
> > > > > > > >> > > > > when
> > > > > > > >> > > > > > > we
> > > > > > > >> > > > > > > > > are
> > > > > > > >> > > > > > > > > > > > > > > determining
> > > > > > > >> > > > > > > > > > > > > > > > the subset of segments to
> > delete.
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > 3.
> > > > > > > >> > > > > > > > > > > > > > > > *"Also, what about time-based
> > > > > retention?
> > > > > > > To
> > > > > > > >> > make
> > > > > > > >> > > > that
> > > > > > > >> > > > > > > > > efficient,
> > > > > > > >> > > > > > > > > > > do
> > > > > > > >> > > > > > > > > > > > > we
> > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > >> > > > > > > > > > > > > > > > to make some additional
> > interface
> > > > > > > >> changes?"*No.
> > > > > > > >> > > > Note
> > > > > > > >> > > > > > that
> > > > > > > >> > > > > > > > > time
> > > > > > > >> > > > > > > > > > > > > > complexity
> > > > > > > >> > > > > > > > > > > > > > > > to determine the segments for
> > > > > retention
> > > > > > is
> > > > > > > >> > > > different
> > > > > > > >> > > > > > for
> > > > > > > >> > > > > > > > time
> > > > > > > >> > > > > > > > > > > based
> > > > > > > >> > > > > > > > > > > > > vs.
> > > > > > > >> > > > > > > > > > > > > > > > size based. For time based,
> the
> > > time
> > > > > > > >> complexity
> > > > > > > >> > > is
> > > > > > > >> > > > a
> > > > > > > >> > > > > > > > > function of
> > > > > > > >> > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > number
> > > > > > > >> > > > > > > > > > > > > > > > of segments which are
> "eligible
> > > for
> > > > > > > >> deletion"
> > > > > > > >> > > > (since
> > > > > > > >> > > > > we
> > > > > > > >> > > > > > > > only
> > > > > > > >> > > > > > > > > read
> > > > > > > >> > > > > > > > > > > > > > > metadata
> > > > > > > >> > > > > > > > > > > > > > > > for segments which would be
> > > deleted)
> > > > > > > >> whereas in
> > > > > > > >> > > > size
> > > > > > > >> > > > > > > based
> > > > > > > >> > > > > > > > > > > > retention,
> > > > > > > >> > > > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > > time complexity is a function
> of
> > > "all
> > > > > > > >> segments"
> > > > > > > >> > > > > > available
> > > > > > > >> > > > > > > > in
> > > > > > > >> > > > > > > > > > > remote
> > > > > > > >> > > > > > > > > > > > > > > storage
> > > > > > > >> > > > > > > > > > > > > > > > (metadata of all segments
> needs
> > > to be
> > > > > > read
> > > > > > > >> to
> > > > > > > >> > > > > calculate
> > > > > > > >> > > > > > > the
> > > > > > > >> > > > > > > > > total
> > > > > > > >> > > > > > > > > > > > > > size).
> > > > > > > >> > > > > > > > > > > > > > > As
> > > > > > > >> > > > > > > > > > > > > > > > you may observe, this KIP will
> > > bring
> > > > > the
> > > > > > > >> time
> > > > > > > >> > > > > > complexity
> > > > > > > >> > > > > > > > for
> > > > > > > >> > > > > > > > > both
> > > > > > > >> > > > > > > > > > > > > time
> > > > > > > >> > > > > > > > > > > > > > > > based retention & size based
> > > retention
> > > > > > to
> > > > > > > >> the
> > > > > > > >> > > same
> > > > > > > >> > > > > > > > function.
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > 4. Also, please note that this
> > > new API
> > > > > > > >> > introduced
> > > > > > > >> > > > in
> > > > > > > >> > > > > > this
> > > > > > > >> > > > > > > > KIP
> > > > > > > >> > > > > > > > > > > also
> > > > > > > >> > > > > > > > > > > > > > > enables
> > > > > > > >> > > > > > > > > > > > > > > > us to provide a metric for
> total
> > > size
> > > > > of
> > > > > > > >> data
> > > > > > > >> > > > stored
> > > > > > > >> > > > > in
> > > > > > > >> > > > > > > > > remote
> > > > > > > >> > > > > > > > > > > > > storage.
> > > > > > > >> > > > > > > > > > > > > > > > Without the API, calculation
> of
> > > this
> > > > > > > metric
> > > > > > > >> > will
> > > > > > > >> > > > > become
> > > > > > > >> > > > > > > > very
> > > > > > > >> > > > > > > > > > > > > expensive
> > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> > > > > > > >> > > > > > > > > > > > > > > > I understand that your
> > motivation
> > > here
> > > > > > is
> > > > > > > to
> > > > > > > >> > > avoid
> > > > > > > >> > > > > > > > polluting
> > > > > > > >> > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > interface
> > > > > > > >> > > > > > > > > > > > > > > > with optimization specific
> APIs
> > > and I
> > > > > > will
> > > > > > > >> > agree
> > > > > > > >> > > > with
> > > > > > > >> > > > > > > that
> > > > > > > >> > > > > > > > > goal.
> > > > > > > >> > > > > > > > > > > > But
> > > > > > > >> > > > > > > > > > > > > I
> > > > > > > >> > > > > > > > > > > > > > > > believe that this new API
> > > proposed in
> > > > > > the
> > > > > > > >> KIP
> > > > > > > >> > > > brings
> > > > > > > >> > > > > in
> > > > > > > >> > > > > > > > > > > significant
> > > > > > > >> > > > > > > > > > > > > > > > improvement and there is no
> > other
> > > work
> > > > > > > >> around
> > > > > > > >> > > > > available
> > > > > > > >> > > > > > > to
> > > > > > > >> > > > > > > > > > > achieve
> > > > > > > >> > > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > same
> > > > > > > >> > > > > > > > > > > > > > > > performance.
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > Regards,
> > > > > > > >> > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12
> AM
> > > Jun
> > > > > Rao
> > > > > > > >> > > > > > > > > <jun@confluent.io.invalid
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry
> for
> > > the
> > > > > late
> > > > > > > >> reply.
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > The motivation of the KIP is
> > to
> > > > > > improve
> > > > > > > >> the
> > > > > > > >> > > > > > efficiency
> > > > > > > >> > > > > > > of
> > > > > > > >> > > > > > > > > size
> > > > > > > >> > > > > > > > > > > > > based
> > > > > > > >> > > > > > > > > > > > > > > > > retention. I am not sure the
> > > > > proposed
> > > > > > > >> changes
> > > > > > > >> > > are
> > > > > > > >> > > > > > > enough.
> > > > > > > >> > > > > > > > > For
> > > > > > > >> > > > > > > > > > > > > > example,
> > > > > > > >> > > > > > > > > > > > > > > if
> > > > > > > >> > > > > > > > > > > > > > > > > the size exceeds the
> retention
> > > size,
> > > > > > we
> > > > > > > >> need
> > > > > > > >> > to
> > > > > > > >> > > > > > > determine
> > > > > > > >> > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > subset
> > > > > > > >> > > > > > > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > > > > > segments to delete to bring
> > the
> > > size
> > > > > > > >> within
> > > > > > > >> > the
> > > > > > > >> > > > > > > retention
> > > > > > > >> > > > > > > > > > > limit.
> > > > > > > >> > > > > > > > > > > > Do
> > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > >> > > > > > > > > > > > > > > > need
> > > > > > > >> > > > > > > > > > > > > > > > > to call
> > > > > > > >> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > determine
> > > > > > > >> > > > > > > > > > > > > > > > that?
> > > > > > > >> > > > > > > > > > > > > > > > > Also, what about time-based
> > > > > retention?
> > > > > > > To
> > > > > > > >> > make
> > > > > > > >> > > > that
> > > > > > > >> > > > > > > > > efficient,
> > > > > > > >> > > > > > > > > > > do
> > > > > > > >> > > > > > > > > > > > > we
> > > > > > > >> > > > > > > > > > > > > > > need
> > > > > > > >> > > > > > > > > > > > > > > > > to make some additional
> > > interface
> > > > > > > changes?
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > An alternative approach is
> for
> > > the
> > > > > > RLMM
> > > > > > > >> > > > implementor
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > > make
> > > > > > > >> > > > > > > > > > > sure
> > > > > > > >> > > > > > > > > > > > > > > > > that
> > > > > > > >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > >> > > > > > > is
> > > > > > > >> > > > > > > > > fast
> > > > > > > >> > > > > > > > > > > > > (e.g.,
> > > > > > > >> > > > > > > > > > > > > > > with
> > > > > > > >> > > > > > > > > > > > > > > > > local caching). This way, we
> > > could
> > > > > > keep
> > > > > > > >> the
> > > > > > > >> > > > > interface
> > > > > > > >> > > > > > > > > simple.
> > > > > > > >> > > > > > > > > > > > Have
> > > > > > > >> > > > > > > > > > > > > we
> > > > > > > >> > > > > > > > > > > > > > > > > considered that?
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > Jun
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28
> > AM
> > > > > Divij
> > > > > > > >> Vaidya
> > > > > > > >> > <
> > > > > > > >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > Does anyone else have any
> > > thoughts
> > > > > > on
> > > > > > > >> this
> > > > > > > >> > > > > before I
> > > > > > > >> > > > > > > > > propose
> > > > > > > >> > > > > > > > > > > > this
> > > > > > > >> > > > > > > > > > > > > > for
> > > > > > > >> > > > > > > > > > > > > > > a
> > > > > > > >> > > > > > > > > > > > > > > > > > vote?
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > --
> > > > > > > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at
> 12:57
> > > PM
> > > > > > Satish
> > > > > > > >> > > Duggana
> > > > > > > >> > > > <
> > > > > > > >> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP
> Divij!
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > This is a nice
> improvement
> > > to
> > > > > > avoid
> > > > > > > >> > > > > recalculation
> > > > > > > >> > > > > > > of
> > > > > > > >> > > > > > > > > size.
> > > > > > > >> > > > > > > > > > > > > > > Customized
> > > > > > > >> > > > > > > > > > > > > > > > > > RLMMs
> > > > > > > >> > > > > > > > > > > > > > > > > > > can implement the best
> > > possible
> > > > > > > >> approach
> > > > > > > >> > by
> > > > > > > >> > > > > > caching
> > > > > > > >> > > > > > > > or
> > > > > > > >> > > > > > > > > > > > > > maintaining
> > > > > > > >> > > > > > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > > > > size
> > > > > > > >> > > > > > > > > > > > > > > > > > > in an efficient way. But
> > > this is
> > > > > > > not a
> > > > > > > >> > big
> > > > > > > >> > > > > > concern
> > > > > > > >> > > > > > > > for
> > > > > > > >> > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > default
> > > > > > > >> > > > > > > > > > > > > > > > > topic
> > > > > > > >> > > > > > > > > > > > > > > > > > > based RLMM as mentioned
> in
> > > the
> > > > > > KIP.
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at
> > > 18:48,
> > > > > > Divij
> > > > > > > >> > Vaidya
> > > > > > > >> > > <
> > > > > > > >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > > >> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > Thank you for your
> > review
> > > > > Luke.
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that would
> the
> > > new
> > > > > > > >> > > > > > `RemoteLogSizeBytes`
> > > > > > > >> > > > > > > > > metric
> > > > > > > >> > > > > > > > > > > > be a
> > > > > > > >> > > > > > > > > > > > > > > > > > performance
> > > > > > > >> > > > > > > > > > > > > > > > > > > > overhead? Although we
> > > move the
> > > > > > > >> > > calculation
> > > > > > > >> > > > > to a
> > > > > > > >> > > > > > > > > seperate
> > > > > > > >> > > > > > > > > > > > API,
> > > > > > > >> > > > > > > > > > > > > > we
> > > > > > > >> > > > > > > > > > > > > > > > > still
> > > > > > > >> > > > > > > > > > > > > > > > > > > > can't assume users
> will
> > > > > > implement
> > > > > > > a
> > > > > > > >> > > > > > light-weight
> > > > > > > >> > > > > > > > > method,
> > > > > > > >> > > > > > > > > > > > > right?
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > This metric would be
> > > logged
> > > > > > using
> > > > > > > >> the
> > > > > > > >> > > > > > information
> > > > > > > >> > > > > > > > > that is
> > > > > > > >> > > > > > > > > > > > > > already
> > > > > > > >> > > > > > > > > > > > > > > > > being
> > > > > > > >> > > > > > > > > > > > > > > > > > > > calculated for
> handling
> > > remote
> > > > > > > >> > retention
> > > > > > > >> > > > > logic,
> > > > > > > >> > > > > > > > > hence, no
> > > > > > > >> > > > > > > > > > > > > > > > additional
> > > > > > > >> > > > > > > > > > > > > > > > > > work
> > > > > > > >> > > > > > > > > > > > > > > > > > > > is required to
> calculate
> > > this
> > > > > > > >> metric.
> > > > > > > >> > > More
> > > > > > > >> > > > > > > > > specifically,
> > > > > > > >> > > > > > > > > > > > > > whenever
> > > > > > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls
> > > > > > > >> getRemoteLogSize
> > > > > > > >> > > > API,
> > > > > > > >> > > > > > this
> > > > > > > >> > > > > > > > > metric
> > > > > > > >> > > > > > > > > > > > > would
> > > > > > > >> > > > > > > > > > > > > > be
> > > > > > > >> > > > > > > > > > > > > > > > > > > captured.
> > > > > > > >> > > > > > > > > > > > > > > > > > > > This API call is made
> > > every
> > > > > time
> > > > > > > >> > > > > > RemoteLogManager
> > > > > > > >> > > > > > > > > wants
> > > > > > > >> > > > > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > > handle
> > > > > > > >> > > > > > > > > > > > > > > > > > expired
> > > > > > > >> > > > > > > > > > > > > > > > > > > > remote log segments
> > (which
> > > > > > should
> > > > > > > be
> > > > > > > >> > > > > periodic).
> > > > > > > >> > > > > > > > Does
> > > > > > > >> > > > > > > > > that
> > > > > > > >> > > > > > > > > > > > > > address
> > > > > > > >> > > > > > > > > > > > > > > > > your
> > > > > > > >> > > > > > > > > > > > > > > > > > > > concern?
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022
> at
> > > 11:01
> > > > > AM
> > > > > > > >> Luke
> > > > > > > >> > > Chen
> > > > > > > >> > > > <
> > > > > > > >> > > > > > > > > > > > > showuon@gmail.com>
> > > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP!
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > I think it makes
> sense
> > > to
> > > > > > > delegate
> > > > > > > >> > the
> > > > > > > >> > > > > > > > > responsibility
> > > > > > > >> > > > > > > > > > > of
> > > > > > > >> > > > > > > > > > > > > > > > > calculation
> > > > > > > >> > > > > > > > > > > > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > > > > > > > > the
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > specific
> > > > > > > RemoteLogMetadataManager
> > > > > > > >> > > > > > > implementation.
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > But one thing I'm
> not
> > > quite
> > > > > > > sure,
> > > > > > > >> is
> > > > > > > >> > > that
> > > > > > > >> > > > > > would
> > > > > > > >> > > > > > > > > the new
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes`
> > > metric
> > > > > > be a
> > > > > > > >> > > > > performance
> > > > > > > >> > > > > > > > > overhead?
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Although we move the
> > > > > > calculation
> > > > > > > >> to a
> > > > > > > >> > > > > > seperate
> > > > > > > >> > > > > > > > > API, we
> > > > > > > >> > > > > > > > > > > > > still
> > > > > > > >> > > > > > > > > > > > > > > > can't
> > > > > > > >> > > > > > > > > > > > > > > > > > > assume
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > users will
> implement a
> > > > > > > >> light-weight
> > > > > > > >> > > > method,
> > > > > > > >> > > > > > > > right?
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > Luke
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022
> at
> > > 5:47
> > > > > PM
> > > > > > > >> Divij
> > > > > > > >> > > > > Vaidya <
> > > > > > > >> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Please take a look
> > at
> > > this
> > > > > > KIP
> > > > > > > >> > which
> > > > > > > >> > > > > > proposes
> > > > > > > >> > > > > > > > an
> > > > > > > >> > > > > > > > > > > > > extension
> > > > > > > >> > > > > > > > > > > > > > to
> > > > > > > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > This
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > is my first KIP
> with
> > > > > Apache
> > > > > > > >> Kafka
> > > > > > > >> > > > > community
> > > > > > > >> > > > > > > so
> > > > > > > >> > > > > > > > > any
> > > > > > > >> > > > > > > > > > > > > feedback
> > > > > > > >> > > > > > > > > > > > > > > > would
> > > > > > > >> > > > > > > > > > > > > > > > > > be
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > highly
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software
> > Engineer
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > >
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Divij Vaidya <di...@gmail.com>.
Thank you folks for reviewing this KIP.

Satish, I have modified the motivation to make it more clear. Now it says,
"Since the main feature of tiered storage is storing a large amount of
data, we expect num_remote_segments to be large. A frequent linear scan
(i.e. listing all segment metadata) could be expensive/slower because of
the underlying storage used by RemoteLogMetadataManager. This slowness to
list all segment metadata could result in the loss of availability...."

Jun, Kamal, Satish, if you don't have any further concerns, I would
appreciate a vote for this KIP in the voting thread -
https://lists.apache.org/thread/soz00990gvzodv7oyqj4ysvktrqy6xfk

--
Divij Vaidya



On Sat, Jul 1, 2023 at 6:16 AM Kamal Chandraprakash <
kamal.chandraprakash@gmail.com> wrote:

> Hi Divij,
>
> Thanks for the explanation. LGTM.
>
> --
> Kamal
>
> On Sat, Jul 1, 2023 at 7:28 AM Satish Duggana <sa...@gmail.com>
> wrote:
>
> > Hi Divij,
> > I am fine with having an API to compute the size as I mentioned in my
> > earlier reply in this mail thread. But I have the below comment for
> > the motivation for this KIP.
> >
> > As you discussed offline, the main issue here is listing calls for
> > remote log segment metadata is slower because of the storage used for
> > RLMM. These can be avoided with this new API.
> >
> > Please add this in the motivation section as it is one of the main
> > motivations for the KIP.
> >
> > Thanks,
> > Satish.
> >
> > On Sat, 1 Jul 2023 at 01:43, Jun Rao <ju...@confluent.io.invalid> wrote:
> > >
> > > Hi, Divij,
> > >
> > > Sorry for the late reply.
> > >
> > > Given your explanation, the new API sounds reasonable to me. Is that
> > enough
> > > to build the external metadata layer for the remote segments or do you
> > need
> > > some additional API changes?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <di...@gmail.com>
> > wrote:
> > >
> > > > Thank you for looking into this Kamal.
> > > >
> > > > You are right in saying that a cold start (i.e. leadership failover
> or
> > > > broker startup) does not impact the broker startup duration. But it
> > does
> > > > have the following impact:
> > > > 1. It leads to a burst of full-scan requests to RLMM in case multiple
> > > > leadership failovers occur at the same time. Even if the RLMM
> > > > implementation has the capability to serve the total size from an
> index
> > > > (and hence handle this burst), we wouldn't be able to use it since
> the
> > > > current API necessarily calls for a full scan.
> > > > 2. The archival (copying of data to tiered storage) process will
> have a
> > > > delayed start. The delayed start of archival could lead to local
> build
> > up
> > > > of data which may lead to disk full.
> > > >
> > > > The disadvantage of adding this new API is that every provider will
> > have to
> > > > implement it, agreed. But I believe that this tradeoff is worthwhile
> > since
> > > > the default implementation could be the same as you mentioned, i.e.
> > keeping
> > > > cumulative in-memory count.
> > > >
> > > > --
> > > > Divij Vaidya
> > > >
> > > >
> > > >
> > > > On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
> > > > kamal.chandraprakash@gmail.com> wrote:
> > > >
> > > > > Hi Divij,
> > > > >
> > > > > Thanks for the KIP! Sorry for the late reply.
> > > > >
> > > > > Can you explain the rejected alternative-3?
> > > > > Store the cumulative size of remote tier log in-memory at
> > > > RemoteLogManager
> > > > > "*Cons*: Every time a broker starts-up, it will scan through all
> the
> > > > > segments in the remote tier to initialise the in-memory value. This
> > would
> > > > > increase the broker start-up time."
> > > > >
> > > > > Keeping the source of truth to determine the remote-log-size in the
> > > > leader
> > > > > would be consistent across different implementations of the plugin.
> > The
> > > > > concern posted in the KIP is that we are calculating the
> > remote-log-size
> > > > on
> > > > > each iteration of the cleaner thread (say 5 mins). If we calculate
> > only
> > > > > once during broker startup or during the leadership reassignment,
> do
> > we
> > > > > still need the cache?
> > > > >
> > > > > The broker startup-time won't be affected by the remote log manager
> > > > > initialisation. The broker continue to start accepting the new
> > > > > produce/fetch requests, while the RLM thread in the background can
> > > > > determine the remote-log-size once and start copying/deleting the
> > > > segments.
> > > > >
> > > > > Thanks,
> > > > > Kamal
> > > > >
> > > > > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <
> divijvaidya13@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Satish / Jun
> > > > > >
> > > > > > Do you have any thoughts on this?
> > > > > >
> > > > > > --
> > > > > > Divij Vaidya
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <
> > divijvaidya13@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hey Jun
> > > > > > >
> > > > > > > It has been a while since this KIP got some attention. While we
> > wait
> > > > > for
> > > > > > > Satish to chime in here, perhaps I can answer your question.
> > > > > > >
> > > > > > > > Could you explain how you exposed the log size in your
> KIP-405
> > > > > > > implementation?
> > > > > > >
> > > > > > > The APIs available in RLMM as per KIP405
> > > > > > > are, addRemoteLogSegmentMetadata(),
> > updateRemoteLogSegmentMetadata(),
> > > > > > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > > > > > putRemotePartitionDeleteMetadata(), listRemoteLogSegments(),
> > > > > > onPartitionLeadershipChanges()
> > > > > > > and onStopPartitions(). None of these APIs allow us to expose
> > the log
> > > > > > size,
> > > > > > > hence, the only option that remains is to list all segments
> using
> > > > > > > listRemoteLogSegments() and aggregate them every time we
> require
> > to
> > > > > > > calculate the size. Based on our prior discussion, this
> requires
> > > > > reading
> > > > > > > all segment metadata which won't work for non-local RLMM
> > > > > implementations.
> > > > > > > Satish's implementation also performs a full scan and
> calculates
> > the
> > > > > > > aggregate. see:
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > > > > > >
> > > > > > >
> > > > > > > Does this answer your question?
> > > > > > >
> > > > > > > --
> > > > > > > Divij Vaidya
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao
> <jun@confluent.io.invalid
> > >
> > > > > > wrote:
> > > > > > >
> > > > > > >> Hi, Divij,
> > > > > > >>
> > > > > > >> Thanks for the explanation.
> > > > > > >>
> > > > > > >> Good question.
> > > > > > >>
> > > > > > >> Hi, Satish,
> > > > > > >>
> > > > > > >> Could you explain how you exposed the log size in your KIP-405
> > > > > > >> implementation?
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >>
> > > > > > >> Jun
> > > > > > >>
> > > > > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <
> > > > divijvaidya13@gmail.com
> > > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Hey Jun
> > > > > > >> >
> > > > > > >> > Yes, it is possible to maintain the log size in the cache
> (see
> > > > > > rejected
> > > > > > >> > alternative#3 in the KIP) but I did not understand how it is
> > > > > possible
> > > > > > to
> > > > > > >> > retrieve it without the new API. The log size could be
> > calculated
> > > > on
> > > > > > >> > startup by scanning through the segments (though I would
> > disagree
> > > > > that
> > > > > > >> this
> > > > > > >> > is the right approach since scanning itself takes order of
> > minutes
> > > > > and
> > > > > > >> > hence delay the start of archive process), and incrementally
> > > > > > maintained
> > > > > > >> > afterwards, even then, we would need an API in
> > > > > > RemoteLogMetadataManager
> > > > > > >> so
> > > > > > >> > that RLM could fetch the cached size!
> > > > > > >> >
> > > > > > >> > If we wish to cache the size without adding a new API, then
> we
> > > > need
> > > > > to
> > > > > > >> > cache the size in RLM itself (instead of RLMM
> implementation)
> > and
> > > > > > >> > incrementally manage it. The downside of longer archive time
> > at
> > > > > > startup
> > > > > > >> > (due to initial scale) still remains valid in this
> situation.
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> > Divij Vaidya
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao
> > <jun@confluent.io.invalid
> > > > >
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > Hi, Divij,
> > > > > > >> > >
> > > > > > >> > > Thanks for the explanation.
> > > > > > >> > >
> > > > > > >> > > If there is in-memory cache, could we maintain the log
> size
> > in
> > > > the
> > > > > > >> cache
> > > > > > >> > > with the existing API? For example, a replica could make a
> > > > > > >> > > listRemoteLogSegments(TopicIdPartition topicIdPartition)
> > call on
> > > > > > >> startup
> > > > > > >> > to
> > > > > > >> > > get the remote segment size before the current
> leaderEpoch.
> > The
> > > > > > leader
> > > > > > >> > > could then maintain the size incrementally afterwards. On
> > leader
> > > > > > >> change,
> > > > > > >> > > other replicas can make a
> > listRemoteLogSegments(TopicIdPartition
> > > > > > >> > > topicIdPartition, int leaderEpoch) call to get the size of
> > newly
> > > > > > >> > generated
> > > > > > >> > > segments.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > >
> > > > > > >> > > Jun
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > > > > > divijvaidya13@gmail.com
> > > > > > >> >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > > Is the new method enough for doing size-based
> retention?
> > > > > > >> > > >
> > > > > > >> > > > Yes. You are right in assuming that this API only
> > provides the
> > > > > > >> Remote
> > > > > > >> > > > storage size (for current epoch chain). We would use
> this
> > API
> > > > > for
> > > > > > >> size
> > > > > > >> > > > based retention along with a value of
> > localOnlyLogSegmentSize
> > > > > > which
> > > > > > >> is
> > > > > > >> > > > computed as
> > Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > > > > >> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size =
> > > > > > >> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I
> have
> > > > > updated
> > > > > > >> the
> > > > > > >> > KIP
> > > > > > >> > > > with this information. You can also check an example
> > > > > > implementation
> > > > > > >> at
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > > Do you imagine all accesses to remote metadata will be
> > > > across
> > > > > > the
> > > > > > >> > > network
> > > > > > >> > > > or will there be some local in-memory cache?
> > > > > > >> > > >
> > > > > > >> > > > I would expect a disk-less implementation to maintain a
> > finite
> > > > > > >> > in-memory
> > > > > > >> > > > cache for segment metadata to optimize the number of
> > network
> > > > > calls
> > > > > > >> made
> > > > > > >> > > to
> > > > > > >> > > > fetch the data. In future, we can think about bringing
> > this
> > > > > finite
> > > > > > >> size
> > > > > > >> > > > cache into RLM itself but that's probably a conversation
> > for a
> > > > > > >> > different
> > > > > > >> > > > KIP. There are many other things we would like to do to
> > > > optimize
> > > > > > the
> > > > > > >> > > Tiered
> > > > > > >> > > > storage interface such as introducing a circular buffer
> /
> > > > > > streaming
> > > > > > >> > > > interface from RSM (so that we don't have to wait to
> > fetch the
> > > > > > >> entire
> > > > > > >> > > > segment before starting to send records to the
> consumer),
> > > > > caching
> > > > > > >> the
> > > > > > >> > > > segments fetched from RSM locally (I would assume all
> RSM
> > > > plugin
> > > > > > >> > > > implementations to do this, might as well add it to RLM)
> > etc.
> > > > > > >> > > >
> > > > > > >> > > > --
> > > > > > >> > > > Divij Vaidya
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> > > > > <jun@confluent.io.invalid
> > > > > > >
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi, Divij,
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks for the reply.
> > > > > > >> > > > >
> > > > > > >> > > > > Is the new method enough for doing size-based
> > retention? It
> > > > > > gives
> > > > > > >> the
> > > > > > >> > > > total
> > > > > > >> > > > > size of the remote segments, but it seems that we
> still
> > > > don't
> > > > > > know
> > > > > > >> > the
> > > > > > >> > > > > exact total size for a log since there could be
> > overlapping
> > > > > > >> segments
> > > > > > >> > > > > between the remote and the local segments.
> > > > > > >> > > > >
> > > > > > >> > > > > You mentioned a disk-less implementation. Do you
> > imagine all
> > > > > > >> accesses
> > > > > > >> > > to
> > > > > > >> > > > > remote metadata will be across the network or will
> > there be
> > > > > some
> > > > > > >> > local
> > > > > > >> > > > > in-memory cache?
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > >
> > > > > > >> > > > > Jun
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > > > > > >> divijvaidya13@gmail.com
> > > > > > >> > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > The method is needed for RLMM implementations which
> > fetch
> > > > > the
> > > > > > >> > > > information
> > > > > > >> > > > > > over the network and not for the disk based
> > > > implementations
> > > > > > >> (such
> > > > > > >> > as
> > > > > > >> > > > the
> > > > > > >> > > > > > default topic based RLMM).
> > > > > > >> > > > > >
> > > > > > >> > > > > > I would argue that adding this API makes the
> interface
> > > > more
> > > > > > >> generic
> > > > > > >> > > > than
> > > > > > >> > > > > > what it is today. This is because, with the current
> > APIs
> > > > an
> > > > > > >> > > implementor
> > > > > > >> > > > > is
> > > > > > >> > > > > > restricted to use disk based RLMM solutions only
> > (i.e. the
> > > > > > >> default
> > > > > > >> > > > > > solution) whereas if we add this new API, we unblock
> > usage
> > > > > of
> > > > > > >> > network
> > > > > > >> > > > > based
> > > > > > >> > > > > > RLMM implementations such as databases.
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > > > > > <jun@confluent.io.invalid
> > > > > > >> >
> > > > > > >> > > > wrote:
> > > > > > >> > > > > >
> > > > > > >> > > > > > > Hi, Divij,
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Thanks for the reply.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Point#2. My high level question is that is the new
> > > > method
> > > > > > >> needed
> > > > > > >> > > for
> > > > > > >> > > > > > every
> > > > > > >> > > > > > > implementation of remote storage or just for a
> > specific
> > > > > > >> > > > implementation.
> > > > > > >> > > > > > The
> > > > > > >> > > > > > > issues that you pointed out exist for the default
> > > > > > >> implementation
> > > > > > >> > of
> > > > > > >> > > > > RLMM
> > > > > > >> > > > > > as
> > > > > > >> > > > > > > well and so far, the default implementation hasn't
> > > > found a
> > > > > > >> need
> > > > > > >> > > for a
> > > > > > >> > > > > > > similar new method. For public interface, ideally
> we
> > > > want
> > > > > to
> > > > > > >> make
> > > > > > >> > > it
> > > > > > >> > > > > more
> > > > > > >> > > > > > > general.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Thanks,
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Jun
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
> > > > > > >> > > > divijvaidya13@gmail.com>
> > > > > > >> > > > > > > wrote:
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > Thank you Jun and Alex for your comments.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > Point#1: You are right Jun. As Alex mentioned,
> the
> > > > > > "derived
> > > > > > >> > > > metadata"
> > > > > > >> > > > > > can
> > > > > > >> > > > > > > > increase the size of cached metadata by a factor
> > of 10
> > > > > but
> > > > > > >> it
> > > > > > >> > > > should
> > > > > > >> > > > > be
> > > > > > >> > > > > > > ok
> > > > > > >> > > > > > > > to cache just the actual metadata. My point
> about
> > size
> > > > > > >> being a
> > > > > > >> > > > > > limitation
> > > > > > >> > > > > > > > for using cache is not valid anymore.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > Point#2: For a new replica, it would still have
> to
> > > > fetch
> > > > > > the
> > > > > > >> > > > metadata
> > > > > > >> > > > > > > over
> > > > > > >> > > > > > > > the network to initiate the warm up of the cache
> > and
> > > > > > hence,
> > > > > > >> > > > increase
> > > > > > >> > > > > > the
> > > > > > >> > > > > > > > start time of the archival process. Please also
> > note
> > > > the
> > > > > > >> > > > > repercussions
> > > > > > >> > > > > > of
> > > > > > >> > > > > > > > the warm up scan that Alex mentioned in this
> > thread as
> > > > > > part
> > > > > > >> of
> > > > > > >> > > > > #102.2.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that.
> My
> > > > point
> > > > > > >> about
> > > > > > >> > > size
> > > > > > >> > > > > > being
> > > > > > >> > > > > > > a
> > > > > > >> > > > > > > > limitation for using cache is not valid anymore.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > 101#: Alex, if I understand correctly, you are
> > > > > suggesting
> > > > > > to
> > > > > > >> > > cache
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > total size at the leader and update it on
> > archival.
> > > > This
> > > > > > >> > wouldn't
> > > > > > >> > > > > work
> > > > > > >> > > > > > > for
> > > > > > >> > > > > > > > cases when the leader restarts where we would
> > have to
> > > > > > make a
> > > > > > >> > full
> > > > > > >> > > > > scan
> > > > > > >> > > > > > > > to update the total size entry on startup. We
> > expect
> > > > > users
> > > > > > >> to
> > > > > > >> > > store
> > > > > > >> > > > > > data
> > > > > > >> > > > > > > > over longer duration in remote storage which
> > increases
> > > > > the
> > > > > > >> > > > likelihood
> > > > > > >> > > > > > of
> > > > > > >> > > > > > > > leader restarts / failovers.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > 102#.1: I don't think that the current design
> > > > > accommodates
> > > > > > >> the
> > > > > > >> > > fact
> > > > > > >> > > > > > that
> > > > > > >> > > > > > > > data corruption could happen at the RLMM plugin
> > (we
> > > > > don't
> > > > > > >> have
> > > > > > >> > > > > checksum
> > > > > > >> > > > > > > as
> > > > > > >> > > > > > > > a field in metadata as part of KIP405). If data
> > > > > corruption
> > > > > > >> > > occurs,
> > > > > > >> > > > w/
> > > > > > >> > > > > > or
> > > > > > >> > > > > > > > w/o the cache, it would be a different problem
> to
> > > > > solve. I
> > > > > > >> > would
> > > > > > >> > > > like
> > > > > > >> > > > > > to
> > > > > > >> > > > > > > > keep this outside the scope of this KIP.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > 102#.2: Agree. This remains as the main concern
> > for
> > > > > using
> > > > > > >> the
> > > > > > >> > > cache
> > > > > > >> > > > > to
> > > > > > >> > > > > > > > fetch total size.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > Regards,
> > > > > > >> > > > > > > > Divij Vaidya
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre
> > Dupriez <
> > > > > > >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > > Hi Divij,
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Thanks for the KIP. Please find some comments
> > based
> > > > on
> > > > > > >> what I
> > > > > > >> > > > read
> > > > > > >> > > > > on
> > > > > > >> > > > > > > > > this thread so far - apologies for the repeats
> > and
> > > > the
> > > > > > >> late
> > > > > > >> > > > reply.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > If I understand correctly, one of the main
> > elements
> > > > of
> > > > > > >> > > discussion
> > > > > > >> > > > > is
> > > > > > >> > > > > > > > > about caching in Kafka versus delegation of
> > > > providing
> > > > > > the
> > > > > > >> > > remote
> > > > > > >> > > > > size
> > > > > > >> > > > > > > > > of a topic-partition to the plugin.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > A few comments:
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > 100. The size of the “derived metadata” which
> is
> > > > > managed
> > > > > > >> by
> > > > > > >> > the
> > > > > > >> > > > > > plugin
> > > > > > >> > > > > > > > > to represent an rlmMetadata can indeed be
> close
> > to 1
> > > > > kB
> > > > > > on
> > > > > > >> > > > average
> > > > > > >> > > > > > > > > depending on its own internal structure, e.g.
> > the
> > > > > > >> redundancy
> > > > > > >> > it
> > > > > > >> > > > > > > > > enforces (unfortunately resulting to
> > duplication),
> > > > > > >> additional
> > > > > > >> > > > > > > > > information such as checksums and primary and
> > > > > secondary
> > > > > > >> > > indexable
> > > > > > >> > > > > > > > > keys. But indeed, the rlmMetadata is itself a
> > > > lighter
> > > > > > data
> > > > > > >> > > > > structure
> > > > > > >> > > > > > > > > by a factor of 10. And indeed, instead of
> > caching
> > > > the
> > > > > > >> > “derived
> > > > > > >> > > > > > > > > metadata”, only the rlmMetadata could be,
> which
> > > > should
> > > > > > >> > address
> > > > > > >> > > > the
> > > > > > >> > > > > > > > > concern regarding the memory occupancy of the
> > cache.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > 101. I am not sure I fully understand why we
> > would
> > > > > need
> > > > > > to
> > > > > > >> > > cache
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > > list of rlmMetadata to retain the remote size
> > of a
> > > > > > >> > > > topic-partition.
> > > > > > >> > > > > > > > > Since the leader of a topic-partition is, in
> > > > > > >> non-degenerated
> > > > > > >> > > > cases,
> > > > > > >> > > > > > > > > the only actor which can mutate the remote
> part
> > of
> > > > the
> > > > > > >> > > > > > > > > topic-partition, hence its size, it could in
> > theory
> > > > > only
> > > > > > >> > cache
> > > > > > >> > > > the
> > > > > > >> > > > > > > > > size of the remote log once it has calculated
> > it? In
> > > > > > which
> > > > > > >> > case
> > > > > > >> > > > > there
> > > > > > >> > > > > > > > > would not be any problem regarding the size of
> > the
> > > > > > caching
> > > > > > >> > > > > strategy.
> > > > > > >> > > > > > > > > Did I miss something there?
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > 102. There may be a few challenges to consider
> > with
> > > > > > >> caching:
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > 102.1) As mentioned above, the caching
> strategy
> > > > > assumes
> > > > > > no
> > > > > > >> > > > mutation
> > > > > > >> > > > > > > > > outside the lifetime of a leader. While this
> is
> > true
> > > > > in
> > > > > > >> the
> > > > > > >> > > > normal
> > > > > > >> > > > > > > > > course of operation, there could be accidental
> > > > > mutation
> > > > > > >> > outside
> > > > > > >> > > > of
> > > > > > >> > > > > > the
> > > > > > >> > > > > > > > > leader and a loss of consistency between the
> > cached
> > > > > > state
> > > > > > >> and
> > > > > > >> > > the
> > > > > > >> > > > > > > > > actual remote representation of the log. E.g.
> > > > > > split-brain
> > > > > > >> > > > > scenarios,
> > > > > > >> > > > > > > > > bugs in the plugins, bugs in external systems
> > with
> > > > > > >> mutating
> > > > > > >> > > > access
> > > > > > >> > > > > on
> > > > > > >> > > > > > > > > the derived metadata. In the worst case, a
> drift
> > > > > between
> > > > > > >> the
> > > > > > >> > > > cached
> > > > > > >> > > > > > > > > size and the actual size could lead to
> > over-deleting
> > > > > > >> remote
> > > > > > >> > > data
> > > > > > >> > > > > > which
> > > > > > >> > > > > > > > > is a durability risk.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > The alternative you propose, by making the
> > plugin
> > > > the
> > > > > > >> source
> > > > > > >> > of
> > > > > > >> > > > > truth
> > > > > > >> > > > > > > > > w.r.t. to the size of the remote log, can make
> > it
> > > > > easier
> > > > > > >> to
> > > > > > >> > > avoid
> > > > > > >> > > > > > > > > inconsistencies between plugin-managed
> metadata
> > and
> > > > > the
> > > > > > >> > remote
> > > > > > >> > > > log
> > > > > > >> > > > > > > > > from the perspective of Kafka. On the other
> > hand,
> > > > > plugin
> > > > > > >> > > vendors
> > > > > > >> > > > > > would
> > > > > > >> > > > > > > > > have to implement it with the expected
> > efficiency to
> > > > > > have
> > > > > > >> it
> > > > > > >> > > > yield
> > > > > > >> > > > > > > > > benefits.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > 102.2) As you mentioned, the caching strategy
> in
> > > > Kafka
> > > > > > >> would
> > > > > > >> > > > still
> > > > > > >> > > > > > > > > require one iteration over the list of
> > rlmMetadata
> > > > > when
> > > > > > >> the
> > > > > > >> > > > > > leadership
> > > > > > >> > > > > > > > > of a topic-partition is assigned to a broker,
> > while
> > > > > the
> > > > > > >> > plugin
> > > > > > >> > > > can
> > > > > > >> > > > > > > > > offer alternative constant-time approaches.
> This
> > > > > > >> calculation
> > > > > > >> > > > cannot
> > > > > > >> > > > > > be
> > > > > > >> > > > > > > > > put on the LeaderAndIsr path and would be
> > performed
> > > > in
> > > > > > the
> > > > > > >> > > > > > background.
> > > > > > >> > > > > > > > > In case of bulk leadership migration, listing
> > the
> > > > > > >> rlmMetadata
> > > > > > >> > > > could
> > > > > > >> > > > > > a)
> > > > > > >> > > > > > > > > result in request bursts to any backend system
> > the
> > > > > > plugin
> > > > > > >> may
> > > > > > >> > > use
> > > > > > >> > > > > > > > > [which shouldn’t be a problem for
> > high-throughput
> > > > data
> > > > > > >> stores
> > > > > > >> > > but
> > > > > > >> > > > > > > > > could have cost implications] b) increase
> > > > utilisation
> > > > > > >> > timespan
> > > > > > >> > > of
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > > RLM threads for these calculations potentially
> > > > leading
> > > > > > to
> > > > > > >> > > > transient
> > > > > > >> > > > > > > > > starvation of tasks queued for, typically,
> > > > offloading
> > > > > > >> > > operations
> > > > > > >> > > > c)
> > > > > > >> > > > > > > > > could have a non-marginal CPU footprint on
> > hardware
> > > > > with
> > > > > > >> > strict
> > > > > > >> > > > > > > > > resource constraints. All these elements could
> > have
> > > > an
> > > > > > >> impact
> > > > > > >> > > to
> > > > > > >> > > > > some
> > > > > > >> > > > > > > > > degree depending on the operational
> environment.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > From a design perspective, one question is
> > where we
> > > > > want
> > > > > > >> the
> > > > > > >> > > > source
> > > > > > >> > > > > > of
> > > > > > >> > > > > > > > > truth w.r.t. remote log size to be during the
> > > > lifetime
> > > > > > of
> > > > > > >> a
> > > > > > >> > > > leader.
> > > > > > >> > > > > > > > > The responsibility of maintaining a consistent
> > > > > > >> representation
> > > > > > >> > > of
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > > remote log is shared by Kafka and the plugin.
> > Which
> > > > > > >> system is
> > > > > > >> > > > best
> > > > > > >> > > > > > > > > placed to maintain such a state while
> providing
> > the
> > > > > > >> highest
> > > > > > >> > > > > > > > > consistency guarantees is something both Kafka
> > and
> > > > > > plugin
> > > > > > >> > > > designers
> > > > > > >> > > > > > > > > could help understand better.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Many thanks,
> > > > > > >> > > > > > > > > Alexandre
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > > > > > >> > <jun@confluent.io.invalid
> > > > > > >> > > >
> > > > > > >> > > > a
> > > > > > >> > > > > > > > écrit :
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Hi, Divij,
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Thanks for the reply.
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Point #1. Is the average remote segment
> > metadata
> > > > > > really
> > > > > > >> > 1KB?
> > > > > > >> > > > > What's
> > > > > > >> > > > > > > > > listed
> > > > > > >> > > > > > > > > > in the public interface is probably well
> > below 100
> > > > > > >> bytes.
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Point #2. I guess you are assuming that each
> > > > broker
> > > > > > only
> > > > > > >> > > caches
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > > remote
> > > > > > >> > > > > > > > > > segment metadata in memory. An alternative
> > > > approach
> > > > > is
> > > > > > >> to
> > > > > > >> > > cache
> > > > > > >> > > > > > them
> > > > > > >> > > > > > > in
> > > > > > >> > > > > > > > > > both memory and local disk. That way, on
> > broker
> > > > > > restart,
> > > > > > >> > you
> > > > > > >> > > > just
> > > > > > >> > > > > > > need
> > > > > > >> > > > > > > > to
> > > > > > >> > > > > > > > > > fetch the new remote segments' metadata
> using
> > the
> > > > > > >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > topicIdPartition,
> > > > > > >> > int
> > > > > > >> > > > > > > > leaderEpoch)
> > > > > > >> > > > > > > > > > api. Will that work?
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Point #3. Thanks for the explanation and it
> > sounds
> > > > > > good.
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Thanks,
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Jun
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij
> Vaidya <
> > > > > > >> > > > > > > divijvaidya13@gmail.com>
> > > > > > >> > > > > > > > > > wrote:
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > > Hi Jun
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > There are three points that I would like
> to
> > > > > present
> > > > > > >> here:
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > 1. We would require a large cache size to
> > > > > > efficiently
> > > > > > >> > cache
> > > > > > >> > > > all
> > > > > > >> > > > > > > > segment
> > > > > > >> > > > > > > > > > > metadata.
> > > > > > >> > > > > > > > > > > 2. Linear scan of all metadata at broker
> > startup
> > > > > to
> > > > > > >> > > populate
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > cache
> > > > > > >> > > > > > > > > will
> > > > > > >> > > > > > > > > > > be slow and will impact the archival
> > process.
> > > > > > >> > > > > > > > > > > 3. There is no other use case where a full
> > scan
> > > > of
> > > > > > >> > segment
> > > > > > >> > > > > > metadata
> > > > > > >> > > > > > > > is
> > > > > > >> > > > > > > > > > > required.
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > Let's start by quantifying 1. Here's my
> > estimate
> > > > > for
> > > > > > >> the
> > > > > > >> > > size
> > > > > > >> > > > > of
> > > > > > >> > > > > > > the
> > > > > > >> > > > > > > > > cache.
> > > > > > >> > > > > > > > > > > Average size of segment metadata = 1KB.
> This
> > > > could
> > > > > > be
> > > > > > >> > more
> > > > > > >> > > if
> > > > > > >> > > > > we
> > > > > > >> > > > > > > have
> > > > > > >> > > > > > > > > > > frequent leader failover with a large
> > number of
> > > > > > leader
> > > > > > >> > > epochs
> > > > > > >> > > > > > being
> > > > > > >> > > > > > > > > stored
> > > > > > >> > > > > > > > > > > per segment.
> > > > > > >> > > > > > > > > > > Segment size = 100MB. Users will prefer to
> > > > reduce
> > > > > > the
> > > > > > >> > > segment
> > > > > > >> > > > > > size
> > > > > > >> > > > > > > > > from the
> > > > > > >> > > > > > > > > > > default value of 1GB to ensure timely
> > archival
> > > > of
> > > > > > data
> > > > > > >> > > since
> > > > > > >> > > > > data
> > > > > > >> > > > > > > > from
> > > > > > >> > > > > > > > > > > active segment is not archived.
> > > > > > >> > > > > > > > > > > Cache size = num segments * avg. segment
> > > > metadata
> > > > > > >> size =
> > > > > > >> > > > > > > > > (100TB/100MB)*1KB
> > > > > > >> > > > > > > > > > > = 1GB.
> > > > > > >> > > > > > > > > > > While 1GB for cache may not sound like a
> > large
> > > > > > number
> > > > > > >> for
> > > > > > >> > > > > larger
> > > > > > >> > > > > > > > > machines,
> > > > > > >> > > > > > > > > > > it does eat into the memory as an
> additional
> > > > cache
> > > > > > and
> > > > > > >> > > makes
> > > > > > >> > > > > use
> > > > > > >> > > > > > > > cases
> > > > > > >> > > > > > > > > with
> > > > > > >> > > > > > > > > > > large data retention with low throughout
> > > > expensive
> > > > > > >> (where
> > > > > > >> > > > such
> > > > > > >> > > > > > use
> > > > > > >> > > > > > > > case
> > > > > > >> > > > > > > > > > > would could use smaller machines).
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > About point#2:
> > > > > > >> > > > > > > > > > > Even if we say that all segment metadata
> > can fit
> > > > > > into
> > > > > > >> the
> > > > > > >> > > > > cache,
> > > > > > >> > > > > > we
> > > > > > >> > > > > > > > > will
> > > > > > >> > > > > > > > > > > need to populate the cache on broker
> > startup. It
> > > > > > would
> > > > > > >> > not
> > > > > > >> > > be
> > > > > > >> > > > > in
> > > > > > >> > > > > > > the
> > > > > > >> > > > > > > > > > > critical patch of broker startup and hence
> > won't
> > > > > > >> impact
> > > > > > >> > the
> > > > > > >> > > > > > startup
> > > > > > >> > > > > > > > > time.
> > > > > > >> > > > > > > > > > > But it will impact the time when we could
> > start
> > > > > the
> > > > > > >> > > archival
> > > > > > >> > > > > > > process
> > > > > > >> > > > > > > > > since
> > > > > > >> > > > > > > > > > > the RLM thread pool will be blocked on the
> > first
> > > > > > call
> > > > > > >> to
> > > > > > >> > > > > > > > > > > listRemoteLogSegments(). To scan metadata
> > for
> > > > 1MM
> > > > > > >> > segments
> > > > > > >> > > > > > > (computed
> > > > > > >> > > > > > > > > above)
> > > > > > >> > > > > > > > > > > and transfer 1GB data over the network
> from
> > a
> > > > RLMM
> > > > > > >> such
> > > > > > >> > as
> > > > > > >> > > a
> > > > > > >> > > > > > remote
> > > > > > >> > > > > > > > > > > database would be in the order of minutes
> > > > > (depending
> > > > > > >> on
> > > > > > >> > how
> > > > > > >> > > > > > > efficient
> > > > > > >> > > > > > > > > the
> > > > > > >> > > > > > > > > > > scan is with the RLMM implementation).
> > > > Although, I
> > > > > > >> would
> > > > > > >> > > > > concede
> > > > > > >> > > > > > > that
> > > > > > >> > > > > > > > > > > having RLM threads blocked for a few
> > minutes is
> > > > > > >> perhaps
> > > > > > >> > OK
> > > > > > >> > > > but
> > > > > > >> > > > > if
> > > > > > >> > > > > > > we
> > > > > > >> > > > > > > > > > > introduce the new API proposed in the KIP,
> > we
> > > > > would
> > > > > > >> have
> > > > > > >> > a
> > > > > > >> > > > > > > > > > > deterministic startup time for RLM. Adding
> > the
> > > > API
> > > > > > >> comes
> > > > > > >> > > at a
> > > > > > >> > > > > low
> > > > > > >> > > > > > > > cost
> > > > > > >> > > > > > > > > and
> > > > > > >> > > > > > > > > > > I believe the trade off is worth it.
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > About point#3:
> > > > > > >> > > > > > > > > > > We can use
> > > > listRemoteLogSegments(TopicIdPartition
> > > > > > >> > > > > > topicIdPartition,
> > > > > > >> > > > > > > > int
> > > > > > >> > > > > > > > > > > leaderEpoch) to calculate the segments
> > eligible
> > > > > for
> > > > > > >> > > deletion
> > > > > > >> > > > > > (based
> > > > > > >> > > > > > > > on
> > > > > > >> > > > > > > > > size
> > > > > > >> > > > > > > > > > > retention) where leader epoch(s) belong to
> > the
> > > > > > current
> > > > > > >> > > leader
> > > > > > >> > > > > > epoch
> > > > > > >> > > > > > > > > chain.
> > > > > > >> > > > > > > > > > > I understand that it may lead to segments
> > > > > belonging
> > > > > > to
> > > > > > >> > > other
> > > > > > >> > > > > > epoch
> > > > > > >> > > > > > > > > lineage
> > > > > > >> > > > > > > > > > > not getting deleted and would require a
> > separate
> > > > > > >> > mechanism
> > > > > > >> > > to
> > > > > > >> > > > > > > delete
> > > > > > >> > > > > > > > > them.
> > > > > > >> > > > > > > > > > > The separate mechanism would anyways be
> > required
> > > > > to
> > > > > > >> > delete
> > > > > > >> > > > > these
> > > > > > >> > > > > > > > > "leaked"
> > > > > > >> > > > > > > > > > > segments as there are other cases which
> > could
> > > > lead
> > > > > > to
> > > > > > >> > leaks
> > > > > > >> > > > > such
> > > > > > >> > > > > > as
> > > > > > >> > > > > > > > > network
> > > > > > >> > > > > > > > > > > problems with RSM mid way writing through.
> > > > segment
> > > > > > >> etc.
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > Thank you for the replies so far. They
> have
> > made
> > > > > me
> > > > > > >> > > re-think
> > > > > > >> > > > my
> > > > > > >> > > > > > > > > assumptions
> > > > > > >> > > > > > > > > > > and this dialogue has been very
> > constructive for
> > > > > me.
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > Regards,
> > > > > > >> > > > > > > > > > > Divij Vaidya
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
> > > > > > >> > > > > > <jun@confluent.io.invalid
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > Hi, Divij,
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > Thanks for the reply.
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > It's true that the data in Kafka could
> be
> > kept
> > > > > > >> longer
> > > > > > >> > > with
> > > > > > >> > > > > > > KIP-405.
> > > > > > >> > > > > > > > > How
> > > > > > >> > > > > > > > > > > > much data do you envision to have per
> > broker?
> > > > > For
> > > > > > >> 100TB
> > > > > > >> > > > data
> > > > > > >> > > > > > per
> > > > > > >> > > > > > > > > broker,
> > > > > > >> > > > > > > > > > > > with 1GB segment and segment metadata of
> > 100
> > > > > > bytes,
> > > > > > >> it
> > > > > > >> > > > > requires
> > > > > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit
> in
> > > > > memory.
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > > > > > >> > listRemoteLogSegments()
> > > > > > >> > > > > > methods.
> > > > > > >> > > > > > > > > The one
> > > > > > >> > > > > > > > > > > > you listed
> > > > > listRemoteLogSegments(TopicIdPartition
> > > > > > >> > > > > > > topicIdPartition,
> > > > > > >> > > > > > > > > int
> > > > > > >> > > > > > > > > > > > leaderEpoch) does return data in offset
> > order.
> > > > > > >> However,
> > > > > > >> > > the
> > > > > > >> > > > > > other
> > > > > > >> > > > > > > > > > > > one
> listRemoteLogSegments(TopicIdPartition
> > > > > > >> > > > topicIdPartition)
> > > > > > >> > > > > > > > doesn't
> > > > > > >> > > > > > > > > > > > specify the return order. I assume that
> > you
> > > > need
> > > > > > the
> > > > > > >> > > latter
> > > > > > >> > > > > to
> > > > > > >> > > > > > > > > calculate
> > > > > > >> > > > > > > > > > > > the segment size?
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > Thanks,
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > Jun
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij
> > Vaidya
> > > > <
> > > > > > >> > > > > > > > > divijvaidya13@gmail.com>
> > > > > > >> > > > > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > *Jun,*
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > *"the default implementation of RLMM
> > does
> > > > > local
> > > > > > >> > > caching,
> > > > > > >> > > > > > > right?"*
> > > > > > >> > > > > > > > > > > > > Yes, Jun. The default implementation
> of
> > RLMM
> > > > > > does
> > > > > > >> > > indeed
> > > > > > >> > > > > > cache
> > > > > > >> > > > > > > > the
> > > > > > >> > > > > > > > > > > > segment
> > > > > > >> > > > > > > > > > > > > metadata today, hence, it won't work
> > for use
> > > > > > cases
> > > > > > >> > when
> > > > > > >> > > > the
> > > > > > >> > > > > > > > number
> > > > > > >> > > > > > > > > of
> > > > > > >> > > > > > > > > > > > > segments in remote storage is large
> > enough
> > > > to
> > > > > > >> exceed
> > > > > > >> > > the
> > > > > > >> > > > > size
> > > > > > >> > > > > > > of
> > > > > > >> > > > > > > > > cache.
> > > > > > >> > > > > > > > > > > > As
> > > > > > >> > > > > > > > > > > > > part of this KIP, I will implement the
> > new
> > > > > > >> proposed
> > > > > > >> > API
> > > > > > >> > > > in
> > > > > > >> > > > > > the
> > > > > > >> > > > > > > > > default
> > > > > > >> > > > > > > > > > > > > implementation of RLMM but the
> > underlying
> > > > > > >> > > implementation
> > > > > > >> > > > > will
> > > > > > >> > > > > > > > > still be
> > > > > > >> > > > > > > > > > > a
> > > > > > >> > > > > > > > > > > > > scan. I will pick up optimizing that
> in
> > a
> > > > > > separate
> > > > > > >> > PR.
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > *"we also cache all segment metadata
> in
> > the
> > > > > > >> brokers
> > > > > > >> > > > without
> > > > > > >> > > > > > > > > KIP-405. Do
> > > > > > >> > > > > > > > > > > > you
> > > > > > >> > > > > > > > > > > > > see a need to change that?"*
> > > > > > >> > > > > > > > > > > > > Please correct me if I am wrong here
> > but we
> > > > > > cache
> > > > > > >> > > > metadata
> > > > > > >> > > > > > for
> > > > > > >> > > > > > > > > segments
> > > > > > >> > > > > > > > > > > > > "residing in local storage". The size
> > of the
> > > > > > >> current
> > > > > > >> > > > cache
> > > > > > >> > > > > > > works
> > > > > > >> > > > > > > > > fine
> > > > > > >> > > > > > > > > > > for
> > > > > > >> > > > > > > > > > > > > the scale of the number of segments
> > that we
> > > > > > >> expect to
> > > > > > >> > > > store
> > > > > > >> > > > > > in
> > > > > > >> > > > > > > > > local
> > > > > > >> > > > > > > > > > > > > storage. After KIP-405, that cache
> will
> > > > > continue
> > > > > > >> to
> > > > > > >> > > store
> > > > > > >> > > > > > > > metadata
> > > > > > >> > > > > > > > > for
> > > > > > >> > > > > > > > > > > > > segments which are residing in local
> > storage
> > > > > and
> > > > > > >> > hence,
> > > > > > >> > > > we
> > > > > > >> > > > > > > don't
> > > > > > >> > > > > > > > > need
> > > > > > >> > > > > > > > > > > to
> > > > > > >> > > > > > > > > > > > > change that. For segments which have
> > been
> > > > > > >> offloaded
> > > > > > >> > to
> > > > > > >> > > > > remote
> > > > > > >> > > > > > > > > storage,
> > > > > > >> > > > > > > > > > > it
> > > > > > >> > > > > > > > > > > > > would rely on RLMM. Note that the
> scale
> > of
> > > > > data
> > > > > > >> > stored
> > > > > > >> > > in
> > > > > > >> > > > > > RLMM
> > > > > > >> > > > > > > is
> > > > > > >> > > > > > > > > > > > different
> > > > > > >> > > > > > > > > > > > > from local cache because the number of
> > > > > segments
> > > > > > is
> > > > > > >> > > > expected
> > > > > > >> > > > > > to
> > > > > > >> > > > > > > be
> > > > > > >> > > > > > > > > much
> > > > > > >> > > > > > > > > > > > > larger than what current
> implementation
> > > > stores
> > > > > > in
> > > > > > >> > local
> > > > > > >> > > > > > > storage.
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > 2,3,4:
> > > > > > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > >> > > > > does
> > > > > > >> > > > > > > > > specify
> > > > > > >> > > > > > > > > > > the
> > > > > > >> > > > > > > > > > > > > order i.e. it returns the segments
> > sorted by
> > > > > > first
> > > > > > >> > > offset
> > > > > > >> > > > > in
> > > > > > >> > > > > > > > > ascending
> > > > > > >> > > > > > > > > > > > > order. I am copying the API docs for
> > KIP-405
> > > > > > here
> > > > > > >> for
> > > > > > >> > > > your
> > > > > > >> > > > > > > > > reference
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > *Returns iterator of remote log
> segment
> > > > > > metadata,
> > > > > > >> > > sorted
> > > > > > >> > > > by
> > > > > > >> > > > > > > > {@link
> > > > > > >> > > > > > > > > > > > >
> RemoteLogSegmentMetadata#startOffset()}
> > > > > > >> inascending
> > > > > > >> > > order
> > > > > > >> > > > > > which
> > > > > > >> > > > > > > > > > > contains
> > > > > > >> > > > > > > > > > > > > the given leader epoch. This is used
> by
> > > > remote
> > > > > > log
> > > > > > >> > > > > retention
> > > > > > >> > > > > > > > > management
> > > > > > >> > > > > > > > > > > > > subsystemto fetch the segment metadata
> > for a
> > > > > > given
> > > > > > >> > > leader
> > > > > > >> > > > > > > > > epoch.@param
> > > > > > >> > > > > > > > > > > > > topicIdPartition topic partition@param
> > > > > > >> leaderEpoch
> > > > > > >> > > > > > leader
> > > > > > >> > > > > > > > > > > > > epoch@return
> > > > > > >> > > > > > > > > > > > > Iterator of remote segments, sorted by
> > start
> > > > > > >> offset
> > > > > > >> > in
> > > > > > >> > > > > > > ascending
> > > > > > >> > > > > > > > > > > order. *
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > *Luke,*
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > 5. Note that we are trying to optimize
> > the
> > > > > > >> efficiency
> > > > > > >> > > of
> > > > > > >> > > > > size
> > > > > > >> > > > > > > > based
> > > > > > >> > > > > > > > > > > > > retention for remote storage. KIP-405
> > does
> > > > not
> > > > > > >> > > introduce
> > > > > > >> > > > a
> > > > > > >> > > > > > new
> > > > > > >> > > > > > > > > config
> > > > > > >> > > > > > > > > > > for
> > > > > > >> > > > > > > > > > > > > periodically checking remote similar
> to
> > > > > > >> > > > > > > > > > > log.retention.check.interval.ms
> > > > > > >> > > > > > > > > > > > > which is applicable for remote
> storage.
> > > > Hence,
> > > > > > the
> > > > > > >> > > metric
> > > > > > >> > > > > > will
> > > > > > >> > > > > > > be
> > > > > > >> > > > > > > > > > > updated
> > > > > > >> > > > > > > > > > > > > at the time of invoking log retention
> > check
> > > > > for
> > > > > > >> > remote
> > > > > > >> > > > tier
> > > > > > >> > > > > > > which
> > > > > > >> > > > > > > > > is
> > > > > > >> > > > > > > > > > > > > pending implementation today. We can
> > perhaps
> > > > > > come
> > > > > > >> > back
> > > > > > >> > > > and
> > > > > > >> > > > > > > update
> > > > > > >> > > > > > > > > the
> > > > > > >> > > > > > > > > > > > > metric description after the
> > implementation
> > > > of
> > > > > > log
> > > > > > >> > > > > retention
> > > > > > >> > > > > > > > check
> > > > > > >> > > > > > > > > in
> > > > > > >> > > > > > > > > > > > > RemoteLogManager.
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > --
> > > > > > >> > > > > > > > > > > > > Divij Vaidya
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke
> > Chen <
> > > > > > >> > > > > showuon@gmail.com
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > Hi Divij,
> > > > > > >> > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > One more question about the metric:
> > > > > > >> > > > > > > > > > > > > > I think the metric will be updated
> > when
> > > > > > >> > > > > > > > > > > > > > (1) each time we run the log
> retention
> > > > check
> > > > > > >> (that
> > > > > > >> > > is,
> > > > > > >> > > > > > > > > > > > > > log.retention.check.interval.ms)
> > > > > > >> > > > > > > > > > > > > > (2) When user explicitly call
> > > > > getRemoteLogSize
> > > > > > >> > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > Is that correct?
> > > > > > >> > > > > > > > > > > > > > Maybe we should add a note in metric
> > > > > > >> description,
> > > > > > >> > > > > > otherwise,
> > > > > > >> > > > > > > > when
> > > > > > >> > > > > > > > > > > user
> > > > > > >> > > > > > > > > > > > > got,
> > > > > > >> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes,
> > will be
> > > > > > >> > surprised.
> > > > > > >> > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > Otherwise, LGTM
> > > > > > >> > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > Thank you for the KIP
> > > > > > >> > > > > > > > > > > > > > Luke
> > > > > > >> > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun
> > Rao
> > > > > > >> > > > > > > > <jun@confluent.io.invalid
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > Hi, Divij,
> > > > > > >> > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > Thanks for the explanation.
> > > > > > >> > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > 1. Hmm, the default implementation
> > of
> > > > RLMM
> > > > > > >> does
> > > > > > >> > > local
> > > > > > >> > > > > > > > caching,
> > > > > > >> > > > > > > > > > > right?
> > > > > > >> > > > > > > > > > > > > > > Currently, we also cache all
> segment
> > > > > > metadata
> > > > > > >> in
> > > > > > >> > > the
> > > > > > >> > > > > > > brokers
> > > > > > >> > > > > > > > > > > without
> > > > > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need to
> change
> > > > that?
> > > > > > >> > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation makes
> > > > sense.
> > > > > > >> > However,
> > > > > > >> > > > > > > > > > > > > > > currently,
> > > > > > >> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > >> > > > > > > > > doesn't
> > > > > > >> > > > > > > > > > > > > > specify
> > > > > > >> > > > > > > > > > > > > > > a particular order of the
> iterator.
> > Do
> > > > you
> > > > > > >> intend
> > > > > > >> > > to
> > > > > > >> > > > > > change
> > > > > > >> > > > > > > > > that?
> > > > > > >> > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > Thanks,
> > > > > > >> > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > Jun
> > > > > > >> > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM
> Divij
> > > > > Vaidya
> > > > > > <
> > > > > > >> > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > Hey Jun
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > Thank you for your comments.
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor could
> ensure
> > > > that
> > > > > > >> > > > > > > > > listRemoteLogSegments()
> > > > > > >> > > > > > > > > > > is
> > > > > > >> > > > > > > > > > > > > > fast"*
> > > > > > >> > > > > > > > > > > > > > > > This would be ideal but
> > pragmatically,
> > > > > it
> > > > > > is
> > > > > > >> > > > > difficult
> > > > > > >> > > > > > to
> > > > > > >> > > > > > > > > ensure
> > > > > > >> > > > > > > > > > > > that
> > > > > > >> > > > > > > > > > > > > > > > listRemoteLogSegments() is fast.
> > This
> > > > is
> > > > > > >> > because
> > > > > > >> > > of
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > > > > possibility
> > > > > > >> > > > > > > > > > > > > of
> > > > > > >> > > > > > > > > > > > > > a
> > > > > > >> > > > > > > > > > > > > > > > large number of segments (much
> > larger
> > > > > than
> > > > > > >> what
> > > > > > >> > > > Kafka
> > > > > > >> > > > > > > > > currently
> > > > > > >> > > > > > > > > > > > > handles
> > > > > > >> > > > > > > > > > > > > > > > with local storage today) would
> > make
> > > > it
> > > > > > >> > > infeasible
> > > > > > >> > > > to
> > > > > > >> > > > > > > adopt
> > > > > > >> > > > > > > > > > > > > strategies
> > > > > > >> > > > > > > > > > > > > > > such
> > > > > > >> > > > > > > > > > > > > > > > as local caching to improve the
> > > > > > performance
> > > > > > >> of
> > > > > > >> > > > > > > > > > > > listRemoteLogSegments.
> > > > > > >> > > > > > > > > > > > > > > Apart
> > > > > > >> > > > > > > > > > > > > > > > from caching (which won't work
> > due to
> > > > > size
> > > > > > >> > > > > > limitations) I
> > > > > > >> > > > > > > > > can't
> > > > > > >> > > > > > > > > > > > think
> > > > > > >> > > > > > > > > > > > > > of
> > > > > > >> > > > > > > > > > > > > > > > other strategies which may
> > eliminate
> > > > the
> > > > > > >> need
> > > > > > >> > for
> > > > > > >> > > > IO
> > > > > > >> > > > > > > > > > > > > > > > operations proportional to the
> > number
> > > > of
> > > > > > >> total
> > > > > > >> > > > > > segments.
> > > > > > >> > > > > > > > > Please
> > > > > > >> > > > > > > > > > > > > advise
> > > > > > >> > > > > > > > > > > > > > if
> > > > > > >> > > > > > > > > > > > > > > > you have something in mind.
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds the
> > > > retention
> > > > > > >> size,
> > > > > > >> > we
> > > > > > >> > > > need
> > > > > > >> > > > > > to
> > > > > > >> > > > > > > > > > > determine
> > > > > > >> > > > > > > > > > > > > the
> > > > > > >> > > > > > > > > > > > > > > > subset of segments to delete to
> > bring
> > > > > the
> > > > > > >> size
> > > > > > >> > > > within
> > > > > > >> > > > > > the
> > > > > > >> > > > > > > > > > > retention
> > > > > > >> > > > > > > > > > > > > > > limit.
> > > > > > >> > > > > > > > > > > > > > > > Do we need to call
> > > > > > >> > > > > > > > > > >
> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > >> > > > > > > > > > > > > to
> > > > > > >> > > > > > > > > > > > > > > > determine that?"*
> > > > > > >> > > > > > > > > > > > > > > > Yes, we need to call
> > > > > > >> listRemoteLogSegments() to
> > > > > > >> > > > > > determine
> > > > > > >> > > > > > > > > which
> > > > > > >> > > > > > > > > > > > > > segments
> > > > > > >> > > > > > > > > > > > > > > > should be deleted. But there is
> a
> > > > > > difference
> > > > > > >> > with
> > > > > > >> > > > the
> > > > > > >> > > > > > use
> > > > > > >> > > > > > > > > case we
> > > > > > >> > > > > > > > > > > > are
> > > > > > >> > > > > > > > > > > > > > > > trying to optimize with this
> KIP.
> > To
> > > > > > >> determine
> > > > > > >> > > the
> > > > > > >> > > > > > subset
> > > > > > >> > > > > > > > of
> > > > > > >> > > > > > > > > > > > segments
> > > > > > >> > > > > > > > > > > > > > > which
> > > > > > >> > > > > > > > > > > > > > > > would be deleted, we only read
> > > > metadata
> > > > > > for
> > > > > > >> > > > segments
> > > > > > >> > > > > > > which
> > > > > > >> > > > > > > > > would
> > > > > > >> > > > > > > > > > > be
> > > > > > >> > > > > > > > > > > > > > > deleted
> > > > > > >> > > > > > > > > > > > > > > > via the listRemoteLogSegments().
> > But
> > > > to
> > > > > > >> > determine
> > > > > > >> > > > the
> > > > > > >> > > > > > > > > > > totalLogSize,
> > > > > > >> > > > > > > > > > > > > > which
> > > > > > >> > > > > > > > > > > > > > > > is required every time retention
> > logic
> > > > > > >> based on
> > > > > > >> > > > size
> > > > > > >> > > > > > > > > executes, we
> > > > > > >> > > > > > > > > > > > > read
> > > > > > >> > > > > > > > > > > > > > > > metadata of *all* the segments
> in
> > > > remote
> > > > > > >> > storage.
> > > > > > >> > > > > > Hence,
> > > > > > >> > > > > > > > the
> > > > > > >> > > > > > > > > > > number
> > > > > > >> > > > > > > > > > > > > of
> > > > > > >> > > > > > > > > > > > > > > > results returned by
> > > > > > >> > > > > > > > > > > >
> > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > >> > > > > > > > > > > > > > *is
> > > > > > >> > > > > > > > > > > > > > > > different when we are
> calculating
> > > > > > >> totalLogSize
> > > > > > >> > > vs.
> > > > > > >> > > > > when
> > > > > > >> > > > > > > we
> > > > > > >> > > > > > > > > are
> > > > > > >> > > > > > > > > > > > > > > determining
> > > > > > >> > > > > > > > > > > > > > > > the subset of segments to
> delete.
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > 3.
> > > > > > >> > > > > > > > > > > > > > > > *"Also, what about time-based
> > > > retention?
> > > > > > To
> > > > > > >> > make
> > > > > > >> > > > that
> > > > > > >> > > > > > > > > efficient,
> > > > > > >> > > > > > > > > > > do
> > > > > > >> > > > > > > > > > > > > we
> > > > > > >> > > > > > > > > > > > > > > need
> > > > > > >> > > > > > > > > > > > > > > > to make some additional
> interface
> > > > > > >> changes?"*No.
> > > > > > >> > > > Note
> > > > > > >> > > > > > that
> > > > > > >> > > > > > > > > time
> > > > > > >> > > > > > > > > > > > > > complexity
> > > > > > >> > > > > > > > > > > > > > > > to determine the segments for
> > > > retention
> > > > > is
> > > > > > >> > > > different
> > > > > > >> > > > > > for
> > > > > > >> > > > > > > > time
> > > > > > >> > > > > > > > > > > based
> > > > > > >> > > > > > > > > > > > > vs.
> > > > > > >> > > > > > > > > > > > > > > > size based. For time based, the
> > time
> > > > > > >> complexity
> > > > > > >> > > is
> > > > > > >> > > > a
> > > > > > >> > > > > > > > > function of
> > > > > > >> > > > > > > > > > > > the
> > > > > > >> > > > > > > > > > > > > > > number
> > > > > > >> > > > > > > > > > > > > > > > of segments which are "eligible
> > for
> > > > > > >> deletion"
> > > > > > >> > > > (since
> > > > > > >> > > > > we
> > > > > > >> > > > > > > > only
> > > > > > >> > > > > > > > > read
> > > > > > >> > > > > > > > > > > > > > > metadata
> > > > > > >> > > > > > > > > > > > > > > > for segments which would be
> > deleted)
> > > > > > >> whereas in
> > > > > > >> > > > size
> > > > > > >> > > > > > > based
> > > > > > >> > > > > > > > > > > > retention,
> > > > > > >> > > > > > > > > > > > > > the
> > > > > > >> > > > > > > > > > > > > > > > time complexity is a function of
> > "all
> > > > > > >> segments"
> > > > > > >> > > > > > available
> > > > > > >> > > > > > > > in
> > > > > > >> > > > > > > > > > > remote
> > > > > > >> > > > > > > > > > > > > > > storage
> > > > > > >> > > > > > > > > > > > > > > > (metadata of all segments needs
> > to be
> > > > > read
> > > > > > >> to
> > > > > > >> > > > > calculate
> > > > > > >> > > > > > > the
> > > > > > >> > > > > > > > > total
> > > > > > >> > > > > > > > > > > > > > size).
> > > > > > >> > > > > > > > > > > > > > > As
> > > > > > >> > > > > > > > > > > > > > > > you may observe, this KIP will
> > bring
> > > > the
> > > > > > >> time
> > > > > > >> > > > > > complexity
> > > > > > >> > > > > > > > for
> > > > > > >> > > > > > > > > both
> > > > > > >> > > > > > > > > > > > > time
> > > > > > >> > > > > > > > > > > > > > > > based retention & size based
> > retention
> > > > > to
> > > > > > >> the
> > > > > > >> > > same
> > > > > > >> > > > > > > > function.
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > 4. Also, please note that this
> > new API
> > > > > > >> > introduced
> > > > > > >> > > > in
> > > > > > >> > > > > > this
> > > > > > >> > > > > > > > KIP
> > > > > > >> > > > > > > > > > > also
> > > > > > >> > > > > > > > > > > > > > > enables
> > > > > > >> > > > > > > > > > > > > > > > us to provide a metric for total
> > size
> > > > of
> > > > > > >> data
> > > > > > >> > > > stored
> > > > > > >> > > > > in
> > > > > > >> > > > > > > > > remote
> > > > > > >> > > > > > > > > > > > > storage.
> > > > > > >> > > > > > > > > > > > > > > > Without the API, calculation of
> > this
> > > > > > metric
> > > > > > >> > will
> > > > > > >> > > > > become
> > > > > > >> > > > > > > > very
> > > > > > >> > > > > > > > > > > > > expensive
> > > > > > >> > > > > > > > > > > > > > > with
> > > > > > >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> > > > > > >> > > > > > > > > > > > > > > > I understand that your
> motivation
> > here
> > > > > is
> > > > > > to
> > > > > > >> > > avoid
> > > > > > >> > > > > > > > polluting
> > > > > > >> > > > > > > > > the
> > > > > > >> > > > > > > > > > > > > > > interface
> > > > > > >> > > > > > > > > > > > > > > > with optimization specific APIs
> > and I
> > > > > will
> > > > > > >> > agree
> > > > > > >> > > > with
> > > > > > >> > > > > > > that
> > > > > > >> > > > > > > > > goal.
> > > > > > >> > > > > > > > > > > > But
> > > > > > >> > > > > > > > > > > > > I
> > > > > > >> > > > > > > > > > > > > > > > believe that this new API
> > proposed in
> > > > > the
> > > > > > >> KIP
> > > > > > >> > > > brings
> > > > > > >> > > > > in
> > > > > > >> > > > > > > > > > > significant
> > > > > > >> > > > > > > > > > > > > > > > improvement and there is no
> other
> > work
> > > > > > >> around
> > > > > > >> > > > > available
> > > > > > >> > > > > > > to
> > > > > > >> > > > > > > > > > > achieve
> > > > > > >> > > > > > > > > > > > > the
> > > > > > >> > > > > > > > > > > > > > > same
> > > > > > >> > > > > > > > > > > > > > > > performance.
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > Regards,
> > > > > > >> > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM
> > Jun
> > > > Rao
> > > > > > >> > > > > > > > > <jun@confluent.io.invalid
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry for
> > the
> > > > late
> > > > > > >> reply.
> > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > The motivation of the KIP is
> to
> > > > > improve
> > > > > > >> the
> > > > > > >> > > > > > efficiency
> > > > > > >> > > > > > > of
> > > > > > >> > > > > > > > > size
> > > > > > >> > > > > > > > > > > > > based
> > > > > > >> > > > > > > > > > > > > > > > > retention. I am not sure the
> > > > proposed
> > > > > > >> changes
> > > > > > >> > > are
> > > > > > >> > > > > > > enough.
> > > > > > >> > > > > > > > > For
> > > > > > >> > > > > > > > > > > > > > example,
> > > > > > >> > > > > > > > > > > > > > > if
> > > > > > >> > > > > > > > > > > > > > > > > the size exceeds the retention
> > size,
> > > > > we
> > > > > > >> need
> > > > > > >> > to
> > > > > > >> > > > > > > determine
> > > > > > >> > > > > > > > > the
> > > > > > >> > > > > > > > > > > > > subset
> > > > > > >> > > > > > > > > > > > > > of
> > > > > > >> > > > > > > > > > > > > > > > > segments to delete to bring
> the
> > size
> > > > > > >> within
> > > > > > >> > the
> > > > > > >> > > > > > > retention
> > > > > > >> > > > > > > > > > > limit.
> > > > > > >> > > > > > > > > > > > Do
> > > > > > >> > > > > > > > > > > > > > we
> > > > > > >> > > > > > > > > > > > > > > > need
> > > > > > >> > > > > > > > > > > > > > > > > to call
> > > > > > >> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > >> > > > > > > > to
> > > > > > >> > > > > > > > > > > > > determine
> > > > > > >> > > > > > > > > > > > > > > > that?
> > > > > > >> > > > > > > > > > > > > > > > > Also, what about time-based
> > > > retention?
> > > > > > To
> > > > > > >> > make
> > > > > > >> > > > that
> > > > > > >> > > > > > > > > efficient,
> > > > > > >> > > > > > > > > > > do
> > > > > > >> > > > > > > > > > > > > we
> > > > > > >> > > > > > > > > > > > > > > need
> > > > > > >> > > > > > > > > > > > > > > > > to make some additional
> > interface
> > > > > > changes?
> > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > An alternative approach is for
> > the
> > > > > RLMM
> > > > > > >> > > > implementor
> > > > > > >> > > > > > to
> > > > > > >> > > > > > > > make
> > > > > > >> > > > > > > > > > > sure
> > > > > > >> > > > > > > > > > > > > > > > > that
> > > > > > >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > >> > > > > > > is
> > > > > > >> > > > > > > > > fast
> > > > > > >> > > > > > > > > > > > > (e.g.,
> > > > > > >> > > > > > > > > > > > > > > with
> > > > > > >> > > > > > > > > > > > > > > > > local caching). This way, we
> > could
> > > > > keep
> > > > > > >> the
> > > > > > >> > > > > interface
> > > > > > >> > > > > > > > > simple.
> > > > > > >> > > > > > > > > > > > Have
> > > > > > >> > > > > > > > > > > > > we
> > > > > > >> > > > > > > > > > > > > > > > > considered that?
> > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > Thanks,
> > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > Jun
> > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28
> AM
> > > > Divij
> > > > > > >> Vaidya
> > > > > > >> > <
> > > > > > >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > Hey folks
> > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > Does anyone else have any
> > thoughts
> > > > > on
> > > > > > >> this
> > > > > > >> > > > > before I
> > > > > > >> > > > > > > > > propose
> > > > > > >> > > > > > > > > > > > this
> > > > > > >> > > > > > > > > > > > > > for
> > > > > > >> > > > > > > > > > > > > > > a
> > > > > > >> > > > > > > > > > > > > > > > > > vote?
> > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > --
> > > > > > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57
> > PM
> > > > > Satish
> > > > > > >> > > Duggana
> > > > > > >> > > > <
> > > > > > >> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP Divij!
> > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > This is a nice improvement
> > to
> > > > > avoid
> > > > > > >> > > > > recalculation
> > > > > > >> > > > > > > of
> > > > > > >> > > > > > > > > size.
> > > > > > >> > > > > > > > > > > > > > > Customized
> > > > > > >> > > > > > > > > > > > > > > > > > RLMMs
> > > > > > >> > > > > > > > > > > > > > > > > > > can implement the best
> > possible
> > > > > > >> approach
> > > > > > >> > by
> > > > > > >> > > > > > caching
> > > > > > >> > > > > > > > or
> > > > > > >> > > > > > > > > > > > > > maintaining
> > > > > > >> > > > > > > > > > > > > > > > the
> > > > > > >> > > > > > > > > > > > > > > > > > size
> > > > > > >> > > > > > > > > > > > > > > > > > > in an efficient way. But
> > this is
> > > > > > not a
> > > > > > >> > big
> > > > > > >> > > > > > concern
> > > > > > >> > > > > > > > for
> > > > > > >> > > > > > > > > the
> > > > > > >> > > > > > > > > > > > > > default
> > > > > > >> > > > > > > > > > > > > > > > > topic
> > > > > > >> > > > > > > > > > > > > > > > > > > based RLMM as mentioned in
> > the
> > > > > KIP.
> > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at
> > 18:48,
> > > > > Divij
> > > > > > >> > Vaidya
> > > > > > >> > > <
> > > > > > >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > > >> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > Thank you for your
> review
> > > > Luke.
> > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that would the
> > new
> > > > > > >> > > > > > `RemoteLogSizeBytes`
> > > > > > >> > > > > > > > > metric
> > > > > > >> > > > > > > > > > > > be a
> > > > > > >> > > > > > > > > > > > > > > > > > performance
> > > > > > >> > > > > > > > > > > > > > > > > > > > overhead? Although we
> > move the
> > > > > > >> > > calculation
> > > > > > >> > > > > to a
> > > > > > >> > > > > > > > > seperate
> > > > > > >> > > > > > > > > > > > API,
> > > > > > >> > > > > > > > > > > > > > we
> > > > > > >> > > > > > > > > > > > > > > > > still
> > > > > > >> > > > > > > > > > > > > > > > > > > > can't assume users will
> > > > > implement
> > > > > > a
> > > > > > >> > > > > > light-weight
> > > > > > >> > > > > > > > > method,
> > > > > > >> > > > > > > > > > > > > right?
> > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > This metric would be
> > logged
> > > > > using
> > > > > > >> the
> > > > > > >> > > > > > information
> > > > > > >> > > > > > > > > that is
> > > > > > >> > > > > > > > > > > > > > already
> > > > > > >> > > > > > > > > > > > > > > > > being
> > > > > > >> > > > > > > > > > > > > > > > > > > > calculated for handling
> > remote
> > > > > > >> > retention
> > > > > > >> > > > > logic,
> > > > > > >> > > > > > > > > hence, no
> > > > > > >> > > > > > > > > > > > > > > > additional
> > > > > > >> > > > > > > > > > > > > > > > > > work
> > > > > > >> > > > > > > > > > > > > > > > > > > > is required to calculate
> > this
> > > > > > >> metric.
> > > > > > >> > > More
> > > > > > >> > > > > > > > > specifically,
> > > > > > >> > > > > > > > > > > > > > whenever
> > > > > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls
> > > > > > >> getRemoteLogSize
> > > > > > >> > > > API,
> > > > > > >> > > > > > this
> > > > > > >> > > > > > > > > metric
> > > > > > >> > > > > > > > > > > > > would
> > > > > > >> > > > > > > > > > > > > > be
> > > > > > >> > > > > > > > > > > > > > > > > > > captured.
> > > > > > >> > > > > > > > > > > > > > > > > > > > This API call is made
> > every
> > > > time
> > > > > > >> > > > > > RemoteLogManager
> > > > > > >> > > > > > > > > wants
> > > > > > >> > > > > > > > > > > to
> > > > > > >> > > > > > > > > > > > > > handle
> > > > > > >> > > > > > > > > > > > > > > > > > expired
> > > > > > >> > > > > > > > > > > > > > > > > > > > remote log segments
> (which
> > > > > should
> > > > > > be
> > > > > > >> > > > > periodic).
> > > > > > >> > > > > > > > Does
> > > > > > >> > > > > > > > > that
> > > > > > >> > > > > > > > > > > > > > address
> > > > > > >> > > > > > > > > > > > > > > > > your
> > > > > > >> > > > > > > > > > > > > > > > > > > > concern?
> > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at
> > 11:01
> > > > AM
> > > > > > >> Luke
> > > > > > >> > > Chen
> > > > > > >> > > > <
> > > > > > >> > > > > > > > > > > > > showuon@gmail.com>
> > > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP!
> > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > I think it makes sense
> > to
> > > > > > delegate
> > > > > > >> > the
> > > > > > >> > > > > > > > > responsibility
> > > > > > >> > > > > > > > > > > of
> > > > > > >> > > > > > > > > > > > > > > > > calculation
> > > > > > >> > > > > > > > > > > > > > > > > > to
> > > > > > >> > > > > > > > > > > > > > > > > > > > the
> > > > > > >> > > > > > > > > > > > > > > > > > > > > specific
> > > > > > RemoteLogMetadataManager
> > > > > > >> > > > > > > implementation.
> > > > > > >> > > > > > > > > > > > > > > > > > > > > But one thing I'm not
> > quite
> > > > > > sure,
> > > > > > >> is
> > > > > > >> > > that
> > > > > > >> > > > > > would
> > > > > > >> > > > > > > > > the new
> > > > > > >> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes`
> > metric
> > > > > be a
> > > > > > >> > > > > performance
> > > > > > >> > > > > > > > > overhead?
> > > > > > >> > > > > > > > > > > > > > > > > > > > > Although we move the
> > > > > calculation
> > > > > > >> to a
> > > > > > >> > > > > > seperate
> > > > > > >> > > > > > > > > API, we
> > > > > > >> > > > > > > > > > > > > still
> > > > > > >> > > > > > > > > > > > > > > > can't
> > > > > > >> > > > > > > > > > > > > > > > > > > assume
> > > > > > >> > > > > > > > > > > > > > > > > > > > > users will implement a
> > > > > > >> light-weight
> > > > > > >> > > > method,
> > > > > > >> > > > > > > > right?
> > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > >> > > > > > > > > > > > > > > > > > > > > Luke
> > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at
> > 5:47
> > > > PM
> > > > > > >> Divij
> > > > > > >> > > > > Vaidya <
> > > > > > >> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > > Please take a look
> at
> > this
> > > > > KIP
> > > > > > >> > which
> > > > > > >> > > > > > proposes
> > > > > > >> > > > > > > > an
> > > > > > >> > > > > > > > > > > > > extension
> > > > > > >> > > > > > > > > > > > > > to
> > > > > > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > > > > > >> > > > > > > > > > > > > > > > > > > > > This
> > > > > > >> > > > > > > > > > > > > > > > > > > > > > is my first KIP with
> > > > Apache
> > > > > > >> Kafka
> > > > > > >> > > > > community
> > > > > > >> > > > > > > so
> > > > > > >> > > > > > > > > any
> > > > > > >> > > > > > > > > > > > > feedback
> > > > > > >> > > > > > > > > > > > > > > > would
> > > > > > >> > > > > > > > > > > > > > > > > > be
> > > > > > >> > > > > > > > > > > > > > > > > > > > > highly
> > > > > > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software
> Engineer
> > > > > > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Kamal Chandraprakash <ka...@gmail.com>.
Hi Divij,

Thanks for the explanation. LGTM.

--
Kamal

On Sat, Jul 1, 2023 at 7:28 AM Satish Duggana <sa...@gmail.com>
wrote:

> Hi Divij,
> I am fine with having an API to compute the size as I mentioned in my
> earlier reply in this mail thread. But I have the below comment for
> the motivation for this KIP.
>
> As you discussed offline, the main issue here is listing calls for
> remote log segment metadata is slower because of the storage used for
> RLMM. These can be avoided with this new API.
>
> Please add this in the motivation section as it is one of the main
> motivations for the KIP.
>
> Thanks,
> Satish.
>
> On Sat, 1 Jul 2023 at 01:43, Jun Rao <ju...@confluent.io.invalid> wrote:
> >
> > Hi, Divij,
> >
> > Sorry for the late reply.
> >
> > Given your explanation, the new API sounds reasonable to me. Is that
> enough
> > to build the external metadata layer for the remote segments or do you
> need
> > some additional API changes?
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <di...@gmail.com>
> wrote:
> >
> > > Thank you for looking into this Kamal.
> > >
> > > You are right in saying that a cold start (i.e. leadership failover or
> > > broker startup) does not impact the broker startup duration. But it
> does
> > > have the following impact:
> > > 1. It leads to a burst of full-scan requests to RLMM in case multiple
> > > leadership failovers occur at the same time. Even if the RLMM
> > > implementation has the capability to serve the total size from an index
> > > (and hence handle this burst), we wouldn't be able to use it since the
> > > current API necessarily calls for a full scan.
> > > 2. The archival (copying of data to tiered storage) process will have a
> > > delayed start. The delayed start of archival could lead to local build
> up
> > > of data which may lead to disk full.
> > >
> > > The disadvantage of adding this new API is that every provider will
> have to
> > > implement it, agreed. But I believe that this tradeoff is worthwhile
> since
> > > the default implementation could be the same as you mentioned, i.e.
> keeping
> > > cumulative in-memory count.
> > >
> > > --
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
> > > kamal.chandraprakash@gmail.com> wrote:
> > >
> > > > Hi Divij,
> > > >
> > > > Thanks for the KIP! Sorry for the late reply.
> > > >
> > > > Can you explain the rejected alternative-3?
> > > > Store the cumulative size of remote tier log in-memory at
> > > RemoteLogManager
> > > > "*Cons*: Every time a broker starts-up, it will scan through all the
> > > > segments in the remote tier to initialise the in-memory value. This
> would
> > > > increase the broker start-up time."
> > > >
> > > > Keeping the source of truth to determine the remote-log-size in the
> > > leader
> > > > would be consistent across different implementations of the plugin.
> The
> > > > concern posted in the KIP is that we are calculating the
> remote-log-size
> > > on
> > > > each iteration of the cleaner thread (say 5 mins). If we calculate
> only
> > > > once during broker startup or during the leadership reassignment, do
> we
> > > > still need the cache?
> > > >
> > > > The broker startup-time won't be affected by the remote log manager
> > > > initialisation. The broker continue to start accepting the new
> > > > produce/fetch requests, while the RLM thread in the background can
> > > > determine the remote-log-size once and start copying/deleting the
> > > segments.
> > > >
> > > > Thanks,
> > > > Kamal
> > > >
> > > > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <divijvaidya13@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Satish / Jun
> > > > >
> > > > > Do you have any thoughts on this?
> > > > >
> > > > > --
> > > > > Divij Vaidya
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <
> divijvaidya13@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hey Jun
> > > > > >
> > > > > > It has been a while since this KIP got some attention. While we
> wait
> > > > for
> > > > > > Satish to chime in here, perhaps I can answer your question.
> > > > > >
> > > > > > > Could you explain how you exposed the log size in your KIP-405
> > > > > > implementation?
> > > > > >
> > > > > > The APIs available in RLMM as per KIP405
> > > > > > are, addRemoteLogSegmentMetadata(),
> updateRemoteLogSegmentMetadata(),
> > > > > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > > > > putRemotePartitionDeleteMetadata(), listRemoteLogSegments(),
> > > > > onPartitionLeadershipChanges()
> > > > > > and onStopPartitions(). None of these APIs allow us to expose
> the log
> > > > > size,
> > > > > > hence, the only option that remains is to list all segments using
> > > > > > listRemoteLogSegments() and aggregate them every time we require
> to
> > > > > > calculate the size. Based on our prior discussion, this requires
> > > > reading
> > > > > > all segment metadata which won't work for non-local RLMM
> > > > implementations.
> > > > > > Satish's implementation also performs a full scan and calculates
> the
> > > > > > aggregate. see:
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > > > > >
> > > > > >
> > > > > > Does this answer your question?
> > > > > >
> > > > > > --
> > > > > > Divij Vaidya
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao <jun@confluent.io.invalid
> >
> > > > > wrote:
> > > > > >
> > > > > >> Hi, Divij,
> > > > > >>
> > > > > >> Thanks for the explanation.
> > > > > >>
> > > > > >> Good question.
> > > > > >>
> > > > > >> Hi, Satish,
> > > > > >>
> > > > > >> Could you explain how you exposed the log size in your KIP-405
> > > > > >> implementation?
> > > > > >>
> > > > > >> Thanks,
> > > > > >>
> > > > > >> Jun
> > > > > >>
> > > > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <
> > > divijvaidya13@gmail.com
> > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hey Jun
> > > > > >> >
> > > > > >> > Yes, it is possible to maintain the log size in the cache (see
> > > > > rejected
> > > > > >> > alternative#3 in the KIP) but I did not understand how it is
> > > > possible
> > > > > to
> > > > > >> > retrieve it without the new API. The log size could be
> calculated
> > > on
> > > > > >> > startup by scanning through the segments (though I would
> disagree
> > > > that
> > > > > >> this
> > > > > >> > is the right approach since scanning itself takes order of
> minutes
> > > > and
> > > > > >> > hence delay the start of archive process), and incrementally
> > > > > maintained
> > > > > >> > afterwards, even then, we would need an API in
> > > > > RemoteLogMetadataManager
> > > > > >> so
> > > > > >> > that RLM could fetch the cached size!
> > > > > >> >
> > > > > >> > If we wish to cache the size without adding a new API, then we
> > > need
> > > > to
> > > > > >> > cache the size in RLM itself (instead of RLMM implementation)
> and
> > > > > >> > incrementally manage it. The downside of longer archive time
> at
> > > > > startup
> > > > > >> > (due to initial scale) still remains valid in this situation.
> > > > > >> >
> > > > > >> > --
> > > > > >> > Divij Vaidya
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao
> <jun@confluent.io.invalid
> > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Hi, Divij,
> > > > > >> > >
> > > > > >> > > Thanks for the explanation.
> > > > > >> > >
> > > > > >> > > If there is in-memory cache, could we maintain the log size
> in
> > > the
> > > > > >> cache
> > > > > >> > > with the existing API? For example, a replica could make a
> > > > > >> > > listRemoteLogSegments(TopicIdPartition topicIdPartition)
> call on
> > > > > >> startup
> > > > > >> > to
> > > > > >> > > get the remote segment size before the current leaderEpoch.
> The
> > > > > leader
> > > > > >> > > could then maintain the size incrementally afterwards. On
> leader
> > > > > >> change,
> > > > > >> > > other replicas can make a
> listRemoteLogSegments(TopicIdPartition
> > > > > >> > > topicIdPartition, int leaderEpoch) call to get the size of
> newly
> > > > > >> > generated
> > > > > >> > > segments.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > >
> > > > > >> > > Jun
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > > > > divijvaidya13@gmail.com
> > > > > >> >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > > Is the new method enough for doing size-based retention?
> > > > > >> > > >
> > > > > >> > > > Yes. You are right in assuming that this API only
> provides the
> > > > > >> Remote
> > > > > >> > > > storage size (for current epoch chain). We would use this
> API
> > > > for
> > > > > >> size
> > > > > >> > > > based retention along with a value of
> localOnlyLogSegmentSize
> > > > > which
> > > > > >> is
> > > > > >> > > > computed as
> Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > > > >> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size =
> > > > > >> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I have
> > > > updated
> > > > > >> the
> > > > > >> > KIP
> > > > > >> > > > with this information. You can also check an example
> > > > > implementation
> > > > > >> at
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > > Do you imagine all accesses to remote metadata will be
> > > across
> > > > > the
> > > > > >> > > network
> > > > > >> > > > or will there be some local in-memory cache?
> > > > > >> > > >
> > > > > >> > > > I would expect a disk-less implementation to maintain a
> finite
> > > > > >> > in-memory
> > > > > >> > > > cache for segment metadata to optimize the number of
> network
> > > > calls
> > > > > >> made
> > > > > >> > > to
> > > > > >> > > > fetch the data. In future, we can think about bringing
> this
> > > > finite
> > > > > >> size
> > > > > >> > > > cache into RLM itself but that's probably a conversation
> for a
> > > > > >> > different
> > > > > >> > > > KIP. There are many other things we would like to do to
> > > optimize
> > > > > the
> > > > > >> > > Tiered
> > > > > >> > > > storage interface such as introducing a circular buffer /
> > > > > streaming
> > > > > >> > > > interface from RSM (so that we don't have to wait to
> fetch the
> > > > > >> entire
> > > > > >> > > > segment before starting to send records to the consumer),
> > > > caching
> > > > > >> the
> > > > > >> > > > segments fetched from RSM locally (I would assume all RSM
> > > plugin
> > > > > >> > > > implementations to do this, might as well add it to RLM)
> etc.
> > > > > >> > > >
> > > > > >> > > > --
> > > > > >> > > > Divij Vaidya
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi, Divij,
> > > > > >> > > > >
> > > > > >> > > > > Thanks for the reply.
> > > > > >> > > > >
> > > > > >> > > > > Is the new method enough for doing size-based
> retention? It
> > > > > gives
> > > > > >> the
> > > > > >> > > > total
> > > > > >> > > > > size of the remote segments, but it seems that we still
> > > don't
> > > > > know
> > > > > >> > the
> > > > > >> > > > > exact total size for a log since there could be
> overlapping
> > > > > >> segments
> > > > > >> > > > > between the remote and the local segments.
> > > > > >> > > > >
> > > > > >> > > > > You mentioned a disk-less implementation. Do you
> imagine all
> > > > > >> accesses
> > > > > >> > > to
> > > > > >> > > > > remote metadata will be across the network or will
> there be
> > > > some
> > > > > >> > local
> > > > > >> > > > > in-memory cache?
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > >
> > > > > >> > > > > Jun
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > > > > >> divijvaidya13@gmail.com
> > > > > >> > >
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > The method is needed for RLMM implementations which
> fetch
> > > > the
> > > > > >> > > > information
> > > > > >> > > > > > over the network and not for the disk based
> > > implementations
> > > > > >> (such
> > > > > >> > as
> > > > > >> > > > the
> > > > > >> > > > > > default topic based RLMM).
> > > > > >> > > > > >
> > > > > >> > > > > > I would argue that adding this API makes the interface
> > > more
> > > > > >> generic
> > > > > >> > > > than
> > > > > >> > > > > > what it is today. This is because, with the current
> APIs
> > > an
> > > > > >> > > implementor
> > > > > >> > > > > is
> > > > > >> > > > > > restricted to use disk based RLMM solutions only
> (i.e. the
> > > > > >> default
> > > > > >> > > > > > solution) whereas if we add this new API, we unblock
> usage
> > > > of
> > > > > >> > network
> > > > > >> > > > > based
> > > > > >> > > > > > RLMM implementations such as databases.
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > > > > <jun@confluent.io.invalid
> > > > > >> >
> > > > > >> > > > wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > > Hi, Divij,
> > > > > >> > > > > > >
> > > > > >> > > > > > > Thanks for the reply.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Point#2. My high level question is that is the new
> > > method
> > > > > >> needed
> > > > > >> > > for
> > > > > >> > > > > > every
> > > > > >> > > > > > > implementation of remote storage or just for a
> specific
> > > > > >> > > > implementation.
> > > > > >> > > > > > The
> > > > > >> > > > > > > issues that you pointed out exist for the default
> > > > > >> implementation
> > > > > >> > of
> > > > > >> > > > > RLMM
> > > > > >> > > > > > as
> > > > > >> > > > > > > well and so far, the default implementation hasn't
> > > found a
> > > > > >> need
> > > > > >> > > for a
> > > > > >> > > > > > > similar new method. For public interface, ideally we
> > > want
> > > > to
> > > > > >> make
> > > > > >> > > it
> > > > > >> > > > > more
> > > > > >> > > > > > > general.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Thanks,
> > > > > >> > > > > > >
> > > > > >> > > > > > > Jun
> > > > > >> > > > > > >
> > > > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
> > > > > >> > > > divijvaidya13@gmail.com>
> > > > > >> > > > > > > wrote:
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Thank you Jun and Alex for your comments.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Point#1: You are right Jun. As Alex mentioned, the
> > > > > "derived
> > > > > >> > > > metadata"
> > > > > >> > > > > > can
> > > > > >> > > > > > > > increase the size of cached metadata by a factor
> of 10
> > > > but
> > > > > >> it
> > > > > >> > > > should
> > > > > >> > > > > be
> > > > > >> > > > > > > ok
> > > > > >> > > > > > > > to cache just the actual metadata. My point about
> size
> > > > > >> being a
> > > > > >> > > > > > limitation
> > > > > >> > > > > > > > for using cache is not valid anymore.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Point#2: For a new replica, it would still have to
> > > fetch
> > > > > the
> > > > > >> > > > metadata
> > > > > >> > > > > > > over
> > > > > >> > > > > > > > the network to initiate the warm up of the cache
> and
> > > > > hence,
> > > > > >> > > > increase
> > > > > >> > > > > > the
> > > > > >> > > > > > > > start time of the archival process. Please also
> note
> > > the
> > > > > >> > > > > repercussions
> > > > > >> > > > > > of
> > > > > >> > > > > > > > the warm up scan that Alex mentioned in this
> thread as
> > > > > part
> > > > > >> of
> > > > > >> > > > > #102.2.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that. My
> > > point
> > > > > >> about
> > > > > >> > > size
> > > > > >> > > > > > being
> > > > > >> > > > > > > a
> > > > > >> > > > > > > > limitation for using cache is not valid anymore.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 101#: Alex, if I understand correctly, you are
> > > > suggesting
> > > > > to
> > > > > >> > > cache
> > > > > >> > > > > the
> > > > > >> > > > > > > > total size at the leader and update it on
> archival.
> > > This
> > > > > >> > wouldn't
> > > > > >> > > > > work
> > > > > >> > > > > > > for
> > > > > >> > > > > > > > cases when the leader restarts where we would
> have to
> > > > > make a
> > > > > >> > full
> > > > > >> > > > > scan
> > > > > >> > > > > > > > to update the total size entry on startup. We
> expect
> > > > users
> > > > > >> to
> > > > > >> > > store
> > > > > >> > > > > > data
> > > > > >> > > > > > > > over longer duration in remote storage which
> increases
> > > > the
> > > > > >> > > > likelihood
> > > > > >> > > > > > of
> > > > > >> > > > > > > > leader restarts / failovers.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 102#.1: I don't think that the current design
> > > > accommodates
> > > > > >> the
> > > > > >> > > fact
> > > > > >> > > > > > that
> > > > > >> > > > > > > > data corruption could happen at the RLMM plugin
> (we
> > > > don't
> > > > > >> have
> > > > > >> > > > > checksum
> > > > > >> > > > > > > as
> > > > > >> > > > > > > > a field in metadata as part of KIP405). If data
> > > > corruption
> > > > > >> > > occurs,
> > > > > >> > > > w/
> > > > > >> > > > > > or
> > > > > >> > > > > > > > w/o the cache, it would be a different problem to
> > > > solve. I
> > > > > >> > would
> > > > > >> > > > like
> > > > > >> > > > > > to
> > > > > >> > > > > > > > keep this outside the scope of this KIP.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 102#.2: Agree. This remains as the main concern
> for
> > > > using
> > > > > >> the
> > > > > >> > > cache
> > > > > >> > > > > to
> > > > > >> > > > > > > > fetch total size.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Regards,
> > > > > >> > > > > > > > Divij Vaidya
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre
> Dupriez <
> > > > > >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > > Hi Divij,
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Thanks for the KIP. Please find some comments
> based
> > > on
> > > > > >> what I
> > > > > >> > > > read
> > > > > >> > > > > on
> > > > > >> > > > > > > > > this thread so far - apologies for the repeats
> and
> > > the
> > > > > >> late
> > > > > >> > > > reply.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > If I understand correctly, one of the main
> elements
> > > of
> > > > > >> > > discussion
> > > > > >> > > > > is
> > > > > >> > > > > > > > > about caching in Kafka versus delegation of
> > > providing
> > > > > the
> > > > > >> > > remote
> > > > > >> > > > > size
> > > > > >> > > > > > > > > of a topic-partition to the plugin.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > A few comments:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 100. The size of the “derived metadata” which is
> > > > managed
> > > > > >> by
> > > > > >> > the
> > > > > >> > > > > > plugin
> > > > > >> > > > > > > > > to represent an rlmMetadata can indeed be close
> to 1
> > > > kB
> > > > > on
> > > > > >> > > > average
> > > > > >> > > > > > > > > depending on its own internal structure, e.g.
> the
> > > > > >> redundancy
> > > > > >> > it
> > > > > >> > > > > > > > > enforces (unfortunately resulting to
> duplication),
> > > > > >> additional
> > > > > >> > > > > > > > > information such as checksums and primary and
> > > > secondary
> > > > > >> > > indexable
> > > > > >> > > > > > > > > keys. But indeed, the rlmMetadata is itself a
> > > lighter
> > > > > data
> > > > > >> > > > > structure
> > > > > >> > > > > > > > > by a factor of 10. And indeed, instead of
> caching
> > > the
> > > > > >> > “derived
> > > > > >> > > > > > > > > metadata”, only the rlmMetadata could be, which
> > > should
> > > > > >> > address
> > > > > >> > > > the
> > > > > >> > > > > > > > > concern regarding the memory occupancy of the
> cache.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 101. I am not sure I fully understand why we
> would
> > > > need
> > > > > to
> > > > > >> > > cache
> > > > > >> > > > > the
> > > > > >> > > > > > > > > list of rlmMetadata to retain the remote size
> of a
> > > > > >> > > > topic-partition.
> > > > > >> > > > > > > > > Since the leader of a topic-partition is, in
> > > > > >> non-degenerated
> > > > > >> > > > cases,
> > > > > >> > > > > > > > > the only actor which can mutate the remote part
> of
> > > the
> > > > > >> > > > > > > > > topic-partition, hence its size, it could in
> theory
> > > > only
> > > > > >> > cache
> > > > > >> > > > the
> > > > > >> > > > > > > > > size of the remote log once it has calculated
> it? In
> > > > > which
> > > > > >> > case
> > > > > >> > > > > there
> > > > > >> > > > > > > > > would not be any problem regarding the size of
> the
> > > > > caching
> > > > > >> > > > > strategy.
> > > > > >> > > > > > > > > Did I miss something there?
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 102. There may be a few challenges to consider
> with
> > > > > >> caching:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 102.1) As mentioned above, the caching strategy
> > > > assumes
> > > > > no
> > > > > >> > > > mutation
> > > > > >> > > > > > > > > outside the lifetime of a leader. While this is
> true
> > > > in
> > > > > >> the
> > > > > >> > > > normal
> > > > > >> > > > > > > > > course of operation, there could be accidental
> > > > mutation
> > > > > >> > outside
> > > > > >> > > > of
> > > > > >> > > > > > the
> > > > > >> > > > > > > > > leader and a loss of consistency between the
> cached
> > > > > state
> > > > > >> and
> > > > > >> > > the
> > > > > >> > > > > > > > > actual remote representation of the log. E.g.
> > > > > split-brain
> > > > > >> > > > > scenarios,
> > > > > >> > > > > > > > > bugs in the plugins, bugs in external systems
> with
> > > > > >> mutating
> > > > > >> > > > access
> > > > > >> > > > > on
> > > > > >> > > > > > > > > the derived metadata. In the worst case, a drift
> > > > between
> > > > > >> the
> > > > > >> > > > cached
> > > > > >> > > > > > > > > size and the actual size could lead to
> over-deleting
> > > > > >> remote
> > > > > >> > > data
> > > > > >> > > > > > which
> > > > > >> > > > > > > > > is a durability risk.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > The alternative you propose, by making the
> plugin
> > > the
> > > > > >> source
> > > > > >> > of
> > > > > >> > > > > truth
> > > > > >> > > > > > > > > w.r.t. to the size of the remote log, can make
> it
> > > > easier
> > > > > >> to
> > > > > >> > > avoid
> > > > > >> > > > > > > > > inconsistencies between plugin-managed metadata
> and
> > > > the
> > > > > >> > remote
> > > > > >> > > > log
> > > > > >> > > > > > > > > from the perspective of Kafka. On the other
> hand,
> > > > plugin
> > > > > >> > > vendors
> > > > > >> > > > > > would
> > > > > >> > > > > > > > > have to implement it with the expected
> efficiency to
> > > > > have
> > > > > >> it
> > > > > >> > > > yield
> > > > > >> > > > > > > > > benefits.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 102.2) As you mentioned, the caching strategy in
> > > Kafka
> > > > > >> would
> > > > > >> > > > still
> > > > > >> > > > > > > > > require one iteration over the list of
> rlmMetadata
> > > > when
> > > > > >> the
> > > > > >> > > > > > leadership
> > > > > >> > > > > > > > > of a topic-partition is assigned to a broker,
> while
> > > > the
> > > > > >> > plugin
> > > > > >> > > > can
> > > > > >> > > > > > > > > offer alternative constant-time approaches. This
> > > > > >> calculation
> > > > > >> > > > cannot
> > > > > >> > > > > > be
> > > > > >> > > > > > > > > put on the LeaderAndIsr path and would be
> performed
> > > in
> > > > > the
> > > > > >> > > > > > background.
> > > > > >> > > > > > > > > In case of bulk leadership migration, listing
> the
> > > > > >> rlmMetadata
> > > > > >> > > > could
> > > > > >> > > > > > a)
> > > > > >> > > > > > > > > result in request bursts to any backend system
> the
> > > > > plugin
> > > > > >> may
> > > > > >> > > use
> > > > > >> > > > > > > > > [which shouldn’t be a problem for
> high-throughput
> > > data
> > > > > >> stores
> > > > > >> > > but
> > > > > >> > > > > > > > > could have cost implications] b) increase
> > > utilisation
> > > > > >> > timespan
> > > > > >> > > of
> > > > > >> > > > > the
> > > > > >> > > > > > > > > RLM threads for these calculations potentially
> > > leading
> > > > > to
> > > > > >> > > > transient
> > > > > >> > > > > > > > > starvation of tasks queued for, typically,
> > > offloading
> > > > > >> > > operations
> > > > > >> > > > c)
> > > > > >> > > > > > > > > could have a non-marginal CPU footprint on
> hardware
> > > > with
> > > > > >> > strict
> > > > > >> > > > > > > > > resource constraints. All these elements could
> have
> > > an
> > > > > >> impact
> > > > > >> > > to
> > > > > >> > > > > some
> > > > > >> > > > > > > > > degree depending on the operational environment.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > From a design perspective, one question is
> where we
> > > > want
> > > > > >> the
> > > > > >> > > > source
> > > > > >> > > > > > of
> > > > > >> > > > > > > > > truth w.r.t. remote log size to be during the
> > > lifetime
> > > > > of
> > > > > >> a
> > > > > >> > > > leader.
> > > > > >> > > > > > > > > The responsibility of maintaining a consistent
> > > > > >> representation
> > > > > >> > > of
> > > > > >> > > > > the
> > > > > >> > > > > > > > > remote log is shared by Kafka and the plugin.
> Which
> > > > > >> system is
> > > > > >> > > > best
> > > > > >> > > > > > > > > placed to maintain such a state while providing
> the
> > > > > >> highest
> > > > > >> > > > > > > > > consistency guarantees is something both Kafka
> and
> > > > > plugin
> > > > > >> > > > designers
> > > > > >> > > > > > > > > could help understand better.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Many thanks,
> > > > > >> > > > > > > > > Alexandre
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > > > > >> > <jun@confluent.io.invalid
> > > > > >> > > >
> > > > > >> > > > a
> > > > > >> > > > > > > > écrit :
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Hi, Divij,
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Thanks for the reply.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Point #1. Is the average remote segment
> metadata
> > > > > really
> > > > > >> > 1KB?
> > > > > >> > > > > What's
> > > > > >> > > > > > > > > listed
> > > > > >> > > > > > > > > > in the public interface is probably well
> below 100
> > > > > >> bytes.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Point #2. I guess you are assuming that each
> > > broker
> > > > > only
> > > > > >> > > caches
> > > > > >> > > > > the
> > > > > >> > > > > > > > > remote
> > > > > >> > > > > > > > > > segment metadata in memory. An alternative
> > > approach
> > > > is
> > > > > >> to
> > > > > >> > > cache
> > > > > >> > > > > > them
> > > > > >> > > > > > > in
> > > > > >> > > > > > > > > > both memory and local disk. That way, on
> broker
> > > > > restart,
> > > > > >> > you
> > > > > >> > > > just
> > > > > >> > > > > > > need
> > > > > >> > > > > > > > to
> > > > > >> > > > > > > > > > fetch the new remote segments' metadata using
> the
> > > > > >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > > topicIdPartition,
> > > > > >> > int
> > > > > >> > > > > > > > leaderEpoch)
> > > > > >> > > > > > > > > > api. Will that work?
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Point #3. Thanks for the explanation and it
> sounds
> > > > > good.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Thanks,
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Jun
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij Vaidya <
> > > > > >> > > > > > > divijvaidya13@gmail.com>
> > > > > >> > > > > > > > > > wrote:
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > > Hi Jun
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > There are three points that I would like to
> > > > present
> > > > > >> here:
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > 1. We would require a large cache size to
> > > > > efficiently
> > > > > >> > cache
> > > > > >> > > > all
> > > > > >> > > > > > > > segment
> > > > > >> > > > > > > > > > > metadata.
> > > > > >> > > > > > > > > > > 2. Linear scan of all metadata at broker
> startup
> > > > to
> > > > > >> > > populate
> > > > > >> > > > > the
> > > > > >> > > > > > > > cache
> > > > > >> > > > > > > > > will
> > > > > >> > > > > > > > > > > be slow and will impact the archival
> process.
> > > > > >> > > > > > > > > > > 3. There is no other use case where a full
> scan
> > > of
> > > > > >> > segment
> > > > > >> > > > > > metadata
> > > > > >> > > > > > > > is
> > > > > >> > > > > > > > > > > required.
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > Let's start by quantifying 1. Here's my
> estimate
> > > > for
> > > > > >> the
> > > > > >> > > size
> > > > > >> > > > > of
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > > cache.
> > > > > >> > > > > > > > > > > Average size of segment metadata = 1KB. This
> > > could
> > > > > be
> > > > > >> > more
> > > > > >> > > if
> > > > > >> > > > > we
> > > > > >> > > > > > > have
> > > > > >> > > > > > > > > > > frequent leader failover with a large
> number of
> > > > > leader
> > > > > >> > > epochs
> > > > > >> > > > > > being
> > > > > >> > > > > > > > > stored
> > > > > >> > > > > > > > > > > per segment.
> > > > > >> > > > > > > > > > > Segment size = 100MB. Users will prefer to
> > > reduce
> > > > > the
> > > > > >> > > segment
> > > > > >> > > > > > size
> > > > > >> > > > > > > > > from the
> > > > > >> > > > > > > > > > > default value of 1GB to ensure timely
> archival
> > > of
> > > > > data
> > > > > >> > > since
> > > > > >> > > > > data
> > > > > >> > > > > > > > from
> > > > > >> > > > > > > > > > > active segment is not archived.
> > > > > >> > > > > > > > > > > Cache size = num segments * avg. segment
> > > metadata
> > > > > >> size =
> > > > > >> > > > > > > > > (100TB/100MB)*1KB
> > > > > >> > > > > > > > > > > = 1GB.
> > > > > >> > > > > > > > > > > While 1GB for cache may not sound like a
> large
> > > > > number
> > > > > >> for
> > > > > >> > > > > larger
> > > > > >> > > > > > > > > machines,
> > > > > >> > > > > > > > > > > it does eat into the memory as an additional
> > > cache
> > > > > and
> > > > > >> > > makes
> > > > > >> > > > > use
> > > > > >> > > > > > > > cases
> > > > > >> > > > > > > > > with
> > > > > >> > > > > > > > > > > large data retention with low throughout
> > > expensive
> > > > > >> (where
> > > > > >> > > > such
> > > > > >> > > > > > use
> > > > > >> > > > > > > > case
> > > > > >> > > > > > > > > > > would could use smaller machines).
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > About point#2:
> > > > > >> > > > > > > > > > > Even if we say that all segment metadata
> can fit
> > > > > into
> > > > > >> the
> > > > > >> > > > > cache,
> > > > > >> > > > > > we
> > > > > >> > > > > > > > > will
> > > > > >> > > > > > > > > > > need to populate the cache on broker
> startup. It
> > > > > would
> > > > > >> > not
> > > > > >> > > be
> > > > > >> > > > > in
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > > > > critical patch of broker startup and hence
> won't
> > > > > >> impact
> > > > > >> > the
> > > > > >> > > > > > startup
> > > > > >> > > > > > > > > time.
> > > > > >> > > > > > > > > > > But it will impact the time when we could
> start
> > > > the
> > > > > >> > > archival
> > > > > >> > > > > > > process
> > > > > >> > > > > > > > > since
> > > > > >> > > > > > > > > > > the RLM thread pool will be blocked on the
> first
> > > > > call
> > > > > >> to
> > > > > >> > > > > > > > > > > listRemoteLogSegments(). To scan metadata
> for
> > > 1MM
> > > > > >> > segments
> > > > > >> > > > > > > (computed
> > > > > >> > > > > > > > > above)
> > > > > >> > > > > > > > > > > and transfer 1GB data over the network from
> a
> > > RLMM
> > > > > >> such
> > > > > >> > as
> > > > > >> > > a
> > > > > >> > > > > > remote
> > > > > >> > > > > > > > > > > database would be in the order of minutes
> > > > (depending
> > > > > >> on
> > > > > >> > how
> > > > > >> > > > > > > efficient
> > > > > >> > > > > > > > > the
> > > > > >> > > > > > > > > > > scan is with the RLMM implementation).
> > > Although, I
> > > > > >> would
> > > > > >> > > > > concede
> > > > > >> > > > > > > that
> > > > > >> > > > > > > > > > > having RLM threads blocked for a few
> minutes is
> > > > > >> perhaps
> > > > > >> > OK
> > > > > >> > > > but
> > > > > >> > > > > if
> > > > > >> > > > > > > we
> > > > > >> > > > > > > > > > > introduce the new API proposed in the KIP,
> we
> > > > would
> > > > > >> have
> > > > > >> > a
> > > > > >> > > > > > > > > > > deterministic startup time for RLM. Adding
> the
> > > API
> > > > > >> comes
> > > > > >> > > at a
> > > > > >> > > > > low
> > > > > >> > > > > > > > cost
> > > > > >> > > > > > > > > and
> > > > > >> > > > > > > > > > > I believe the trade off is worth it.
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > About point#3:
> > > > > >> > > > > > > > > > > We can use
> > > listRemoteLogSegments(TopicIdPartition
> > > > > >> > > > > > topicIdPartition,
> > > > > >> > > > > > > > int
> > > > > >> > > > > > > > > > > leaderEpoch) to calculate the segments
> eligible
> > > > for
> > > > > >> > > deletion
> > > > > >> > > > > > (based
> > > > > >> > > > > > > > on
> > > > > >> > > > > > > > > size
> > > > > >> > > > > > > > > > > retention) where leader epoch(s) belong to
> the
> > > > > current
> > > > > >> > > leader
> > > > > >> > > > > > epoch
> > > > > >> > > > > > > > > chain.
> > > > > >> > > > > > > > > > > I understand that it may lead to segments
> > > > belonging
> > > > > to
> > > > > >> > > other
> > > > > >> > > > > > epoch
> > > > > >> > > > > > > > > lineage
> > > > > >> > > > > > > > > > > not getting deleted and would require a
> separate
> > > > > >> > mechanism
> > > > > >> > > to
> > > > > >> > > > > > > delete
> > > > > >> > > > > > > > > them.
> > > > > >> > > > > > > > > > > The separate mechanism would anyways be
> required
> > > > to
> > > > > >> > delete
> > > > > >> > > > > these
> > > > > >> > > > > > > > > "leaked"
> > > > > >> > > > > > > > > > > segments as there are other cases which
> could
> > > lead
> > > > > to
> > > > > >> > leaks
> > > > > >> > > > > such
> > > > > >> > > > > > as
> > > > > >> > > > > > > > > network
> > > > > >> > > > > > > > > > > problems with RSM mid way writing through.
> > > segment
> > > > > >> etc.
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > Thank you for the replies so far. They have
> made
> > > > me
> > > > > >> > > re-think
> > > > > >> > > > my
> > > > > >> > > > > > > > > assumptions
> > > > > >> > > > > > > > > > > and this dialogue has been very
> constructive for
> > > > me.
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > Regards,
> > > > > >> > > > > > > > > > > Divij Vaidya
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
> > > > > >> > > > > > <jun@confluent.io.invalid
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > > wrote:
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > > Hi, Divij,
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > Thanks for the reply.
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > It's true that the data in Kafka could be
> kept
> > > > > >> longer
> > > > > >> > > with
> > > > > >> > > > > > > KIP-405.
> > > > > >> > > > > > > > > How
> > > > > >> > > > > > > > > > > > much data do you envision to have per
> broker?
> > > > For
> > > > > >> 100TB
> > > > > >> > > > data
> > > > > >> > > > > > per
> > > > > >> > > > > > > > > broker,
> > > > > >> > > > > > > > > > > > with 1GB segment and segment metadata of
> 100
> > > > > bytes,
> > > > > >> it
> > > > > >> > > > > requires
> > > > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit in
> > > > memory.
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > > > > >> > listRemoteLogSegments()
> > > > > >> > > > > > methods.
> > > > > >> > > > > > > > > The one
> > > > > >> > > > > > > > > > > > you listed
> > > > listRemoteLogSegments(TopicIdPartition
> > > > > >> > > > > > > topicIdPartition,
> > > > > >> > > > > > > > > int
> > > > > >> > > > > > > > > > > > leaderEpoch) does return data in offset
> order.
> > > > > >> However,
> > > > > >> > > the
> > > > > >> > > > > > other
> > > > > >> > > > > > > > > > > > one listRemoteLogSegments(TopicIdPartition
> > > > > >> > > > topicIdPartition)
> > > > > >> > > > > > > > doesn't
> > > > > >> > > > > > > > > > > > specify the return order. I assume that
> you
> > > need
> > > > > the
> > > > > >> > > latter
> > > > > >> > > > > to
> > > > > >> > > > > > > > > calculate
> > > > > >> > > > > > > > > > > > the segment size?
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > Thanks,
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > Jun
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij
> Vaidya
> > > <
> > > > > >> > > > > > > > > divijvaidya13@gmail.com>
> > > > > >> > > > > > > > > > > > wrote:
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > *Jun,*
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > *"the default implementation of RLMM
> does
> > > > local
> > > > > >> > > caching,
> > > > > >> > > > > > > right?"*
> > > > > >> > > > > > > > > > > > > Yes, Jun. The default implementation of
> RLMM
> > > > > does
> > > > > >> > > indeed
> > > > > >> > > > > > cache
> > > > > >> > > > > > > > the
> > > > > >> > > > > > > > > > > > segment
> > > > > >> > > > > > > > > > > > > metadata today, hence, it won't work
> for use
> > > > > cases
> > > > > >> > when
> > > > > >> > > > the
> > > > > >> > > > > > > > number
> > > > > >> > > > > > > > > of
> > > > > >> > > > > > > > > > > > > segments in remote storage is large
> enough
> > > to
> > > > > >> exceed
> > > > > >> > > the
> > > > > >> > > > > size
> > > > > >> > > > > > > of
> > > > > >> > > > > > > > > cache.
> > > > > >> > > > > > > > > > > > As
> > > > > >> > > > > > > > > > > > > part of this KIP, I will implement the
> new
> > > > > >> proposed
> > > > > >> > API
> > > > > >> > > > in
> > > > > >> > > > > > the
> > > > > >> > > > > > > > > default
> > > > > >> > > > > > > > > > > > > implementation of RLMM but the
> underlying
> > > > > >> > > implementation
> > > > > >> > > > > will
> > > > > >> > > > > > > > > still be
> > > > > >> > > > > > > > > > > a
> > > > > >> > > > > > > > > > > > > scan. I will pick up optimizing that in
> a
> > > > > separate
> > > > > >> > PR.
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > *"we also cache all segment metadata in
> the
> > > > > >> brokers
> > > > > >> > > > without
> > > > > >> > > > > > > > > KIP-405. Do
> > > > > >> > > > > > > > > > > > you
> > > > > >> > > > > > > > > > > > > see a need to change that?"*
> > > > > >> > > > > > > > > > > > > Please correct me if I am wrong here
> but we
> > > > > cache
> > > > > >> > > > metadata
> > > > > >> > > > > > for
> > > > > >> > > > > > > > > segments
> > > > > >> > > > > > > > > > > > > "residing in local storage". The size
> of the
> > > > > >> current
> > > > > >> > > > cache
> > > > > >> > > > > > > works
> > > > > >> > > > > > > > > fine
> > > > > >> > > > > > > > > > > for
> > > > > >> > > > > > > > > > > > > the scale of the number of segments
> that we
> > > > > >> expect to
> > > > > >> > > > store
> > > > > >> > > > > > in
> > > > > >> > > > > > > > > local
> > > > > >> > > > > > > > > > > > > storage. After KIP-405, that cache will
> > > > continue
> > > > > >> to
> > > > > >> > > store
> > > > > >> > > > > > > > metadata
> > > > > >> > > > > > > > > for
> > > > > >> > > > > > > > > > > > > segments which are residing in local
> storage
> > > > and
> > > > > >> > hence,
> > > > > >> > > > we
> > > > > >> > > > > > > don't
> > > > > >> > > > > > > > > need
> > > > > >> > > > > > > > > > > to
> > > > > >> > > > > > > > > > > > > change that. For segments which have
> been
> > > > > >> offloaded
> > > > > >> > to
> > > > > >> > > > > remote
> > > > > >> > > > > > > > > storage,
> > > > > >> > > > > > > > > > > it
> > > > > >> > > > > > > > > > > > > would rely on RLMM. Note that the scale
> of
> > > > data
> > > > > >> > stored
> > > > > >> > > in
> > > > > >> > > > > > RLMM
> > > > > >> > > > > > > is
> > > > > >> > > > > > > > > > > > different
> > > > > >> > > > > > > > > > > > > from local cache because the number of
> > > > segments
> > > > > is
> > > > > >> > > > expected
> > > > > >> > > > > > to
> > > > > >> > > > > > > be
> > > > > >> > > > > > > > > much
> > > > > >> > > > > > > > > > > > > larger than what current implementation
> > > stores
> > > > > in
> > > > > >> > local
> > > > > >> > > > > > > storage.
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > 2,3,4:
> > > > > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > >> > > > > does
> > > > > >> > > > > > > > > specify
> > > > > >> > > > > > > > > > > the
> > > > > >> > > > > > > > > > > > > order i.e. it returns the segments
> sorted by
> > > > > first
> > > > > >> > > offset
> > > > > >> > > > > in
> > > > > >> > > > > > > > > ascending
> > > > > >> > > > > > > > > > > > > order. I am copying the API docs for
> KIP-405
> > > > > here
> > > > > >> for
> > > > > >> > > > your
> > > > > >> > > > > > > > > reference
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > *Returns iterator of remote log segment
> > > > > metadata,
> > > > > >> > > sorted
> > > > > >> > > > by
> > > > > >> > > > > > > > {@link
> > > > > >> > > > > > > > > > > > > RemoteLogSegmentMetadata#startOffset()}
> > > > > >> inascending
> > > > > >> > > order
> > > > > >> > > > > > which
> > > > > >> > > > > > > > > > > contains
> > > > > >> > > > > > > > > > > > > the given leader epoch. This is used by
> > > remote
> > > > > log
> > > > > >> > > > > retention
> > > > > >> > > > > > > > > management
> > > > > >> > > > > > > > > > > > > subsystemto fetch the segment metadata
> for a
> > > > > given
> > > > > >> > > leader
> > > > > >> > > > > > > > > epoch.@param
> > > > > >> > > > > > > > > > > > > topicIdPartition topic partition@param
> > > > > >> leaderEpoch
> > > > > >> > > > > > leader
> > > > > >> > > > > > > > > > > > > epoch@return
> > > > > >> > > > > > > > > > > > > Iterator of remote segments, sorted by
> start
> > > > > >> offset
> > > > > >> > in
> > > > > >> > > > > > > ascending
> > > > > >> > > > > > > > > > > order. *
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > *Luke,*
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > 5. Note that we are trying to optimize
> the
> > > > > >> efficiency
> > > > > >> > > of
> > > > > >> > > > > size
> > > > > >> > > > > > > > based
> > > > > >> > > > > > > > > > > > > retention for remote storage. KIP-405
> does
> > > not
> > > > > >> > > introduce
> > > > > >> > > > a
> > > > > >> > > > > > new
> > > > > >> > > > > > > > > config
> > > > > >> > > > > > > > > > > for
> > > > > >> > > > > > > > > > > > > periodically checking remote similar to
> > > > > >> > > > > > > > > > > log.retention.check.interval.ms
> > > > > >> > > > > > > > > > > > > which is applicable for remote storage.
> > > Hence,
> > > > > the
> > > > > >> > > metric
> > > > > >> > > > > > will
> > > > > >> > > > > > > be
> > > > > >> > > > > > > > > > > updated
> > > > > >> > > > > > > > > > > > > at the time of invoking log retention
> check
> > > > for
> > > > > >> > remote
> > > > > >> > > > tier
> > > > > >> > > > > > > which
> > > > > >> > > > > > > > > is
> > > > > >> > > > > > > > > > > > > pending implementation today. We can
> perhaps
> > > > > come
> > > > > >> > back
> > > > > >> > > > and
> > > > > >> > > > > > > update
> > > > > >> > > > > > > > > the
> > > > > >> > > > > > > > > > > > > metric description after the
> implementation
> > > of
> > > > > log
> > > > > >> > > > > retention
> > > > > >> > > > > > > > check
> > > > > >> > > > > > > > > in
> > > > > >> > > > > > > > > > > > > RemoteLogManager.
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > --
> > > > > >> > > > > > > > > > > > > Divij Vaidya
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke
> Chen <
> > > > > >> > > > > showuon@gmail.com
> > > > > >> > > > > > >
> > > > > >> > > > > > > > > wrote:
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > Hi Divij,
> > > > > >> > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > One more question about the metric:
> > > > > >> > > > > > > > > > > > > > I think the metric will be updated
> when
> > > > > >> > > > > > > > > > > > > > (1) each time we run the log retention
> > > check
> > > > > >> (that
> > > > > >> > > is,
> > > > > >> > > > > > > > > > > > > > log.retention.check.interval.ms)
> > > > > >> > > > > > > > > > > > > > (2) When user explicitly call
> > > > getRemoteLogSize
> > > > > >> > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > Is that correct?
> > > > > >> > > > > > > > > > > > > > Maybe we should add a note in metric
> > > > > >> description,
> > > > > >> > > > > > otherwise,
> > > > > >> > > > > > > > when
> > > > > >> > > > > > > > > > > user
> > > > > >> > > > > > > > > > > > > got,
> > > > > >> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes,
> will be
> > > > > >> > surprised.
> > > > > >> > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > Otherwise, LGTM
> > > > > >> > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > Thank you for the KIP
> > > > > >> > > > > > > > > > > > > > Luke
> > > > > >> > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun
> Rao
> > > > > >> > > > > > > > <jun@confluent.io.invalid
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > > > > wrote:
> > > > > >> > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > Hi, Divij,
> > > > > >> > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > Thanks for the explanation.
> > > > > >> > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > 1. Hmm, the default implementation
> of
> > > RLMM
> > > > > >> does
> > > > > >> > > local
> > > > > >> > > > > > > > caching,
> > > > > >> > > > > > > > > > > right?
> > > > > >> > > > > > > > > > > > > > > Currently, we also cache all segment
> > > > > metadata
> > > > > >> in
> > > > > >> > > the
> > > > > >> > > > > > > brokers
> > > > > >> > > > > > > > > > > without
> > > > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need to change
> > > that?
> > > > > >> > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation makes
> > > sense.
> > > > > >> > However,
> > > > > >> > > > > > > > > > > > > > > currently,
> > > > > >> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > >> > > > > > > > > doesn't
> > > > > >> > > > > > > > > > > > > > specify
> > > > > >> > > > > > > > > > > > > > > a particular order of the iterator.
> Do
> > > you
> > > > > >> intend
> > > > > >> > > to
> > > > > >> > > > > > change
> > > > > >> > > > > > > > > that?
> > > > > >> > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > Thanks,
> > > > > >> > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > Jun
> > > > > >> > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM Divij
> > > > Vaidya
> > > > > <
> > > > > >> > > > > > > > > > > divijvaidya13@gmail.com
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > >> > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > Hey Jun
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > Thank you for your comments.
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor could ensure
> > > that
> > > > > >> > > > > > > > > listRemoteLogSegments()
> > > > > >> > > > > > > > > > > is
> > > > > >> > > > > > > > > > > > > > fast"*
> > > > > >> > > > > > > > > > > > > > > > This would be ideal but
> pragmatically,
> > > > it
> > > > > is
> > > > > >> > > > > difficult
> > > > > >> > > > > > to
> > > > > >> > > > > > > > > ensure
> > > > > >> > > > > > > > > > > > that
> > > > > >> > > > > > > > > > > > > > > > listRemoteLogSegments() is fast.
> This
> > > is
> > > > > >> > because
> > > > > >> > > of
> > > > > >> > > > > the
> > > > > >> > > > > > > > > > > possibility
> > > > > >> > > > > > > > > > > > > of
> > > > > >> > > > > > > > > > > > > > a
> > > > > >> > > > > > > > > > > > > > > > large number of segments (much
> larger
> > > > than
> > > > > >> what
> > > > > >> > > > Kafka
> > > > > >> > > > > > > > > currently
> > > > > >> > > > > > > > > > > > > handles
> > > > > >> > > > > > > > > > > > > > > > with local storage today) would
> make
> > > it
> > > > > >> > > infeasible
> > > > > >> > > > to
> > > > > >> > > > > > > adopt
> > > > > >> > > > > > > > > > > > > strategies
> > > > > >> > > > > > > > > > > > > > > such
> > > > > >> > > > > > > > > > > > > > > > as local caching to improve the
> > > > > performance
> > > > > >> of
> > > > > >> > > > > > > > > > > > listRemoteLogSegments.
> > > > > >> > > > > > > > > > > > > > > Apart
> > > > > >> > > > > > > > > > > > > > > > from caching (which won't work
> due to
> > > > size
> > > > > >> > > > > > limitations) I
> > > > > >> > > > > > > > > can't
> > > > > >> > > > > > > > > > > > think
> > > > > >> > > > > > > > > > > > > > of
> > > > > >> > > > > > > > > > > > > > > > other strategies which may
> eliminate
> > > the
> > > > > >> need
> > > > > >> > for
> > > > > >> > > > IO
> > > > > >> > > > > > > > > > > > > > > > operations proportional to the
> number
> > > of
> > > > > >> total
> > > > > >> > > > > > segments.
> > > > > >> > > > > > > > > Please
> > > > > >> > > > > > > > > > > > > advise
> > > > > >> > > > > > > > > > > > > > if
> > > > > >> > > > > > > > > > > > > > > > you have something in mind.
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds the
> > > retention
> > > > > >> size,
> > > > > >> > we
> > > > > >> > > > need
> > > > > >> > > > > > to
> > > > > >> > > > > > > > > > > determine
> > > > > >> > > > > > > > > > > > > the
> > > > > >> > > > > > > > > > > > > > > > subset of segments to delete to
> bring
> > > > the
> > > > > >> size
> > > > > >> > > > within
> > > > > >> > > > > > the
> > > > > >> > > > > > > > > > > retention
> > > > > >> > > > > > > > > > > > > > > limit.
> > > > > >> > > > > > > > > > > > > > > > Do we need to call
> > > > > >> > > > > > > > > > >
> RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > >> > > > > > > > > > > > > to
> > > > > >> > > > > > > > > > > > > > > > determine that?"*
> > > > > >> > > > > > > > > > > > > > > > Yes, we need to call
> > > > > >> listRemoteLogSegments() to
> > > > > >> > > > > > determine
> > > > > >> > > > > > > > > which
> > > > > >> > > > > > > > > > > > > > segments
> > > > > >> > > > > > > > > > > > > > > > should be deleted. But there is a
> > > > > difference
> > > > > >> > with
> > > > > >> > > > the
> > > > > >> > > > > > use
> > > > > >> > > > > > > > > case we
> > > > > >> > > > > > > > > > > > are
> > > > > >> > > > > > > > > > > > > > > > trying to optimize with this KIP.
> To
> > > > > >> determine
> > > > > >> > > the
> > > > > >> > > > > > subset
> > > > > >> > > > > > > > of
> > > > > >> > > > > > > > > > > > segments
> > > > > >> > > > > > > > > > > > > > > which
> > > > > >> > > > > > > > > > > > > > > > would be deleted, we only read
> > > metadata
> > > > > for
> > > > > >> > > > segments
> > > > > >> > > > > > > which
> > > > > >> > > > > > > > > would
> > > > > >> > > > > > > > > > > be
> > > > > >> > > > > > > > > > > > > > > deleted
> > > > > >> > > > > > > > > > > > > > > > via the listRemoteLogSegments().
> But
> > > to
> > > > > >> > determine
> > > > > >> > > > the
> > > > > >> > > > > > > > > > > totalLogSize,
> > > > > >> > > > > > > > > > > > > > which
> > > > > >> > > > > > > > > > > > > > > > is required every time retention
> logic
> > > > > >> based on
> > > > > >> > > > size
> > > > > >> > > > > > > > > executes, we
> > > > > >> > > > > > > > > > > > > read
> > > > > >> > > > > > > > > > > > > > > > metadata of *all* the segments in
> > > remote
> > > > > >> > storage.
> > > > > >> > > > > > Hence,
> > > > > >> > > > > > > > the
> > > > > >> > > > > > > > > > > number
> > > > > >> > > > > > > > > > > > > of
> > > > > >> > > > > > > > > > > > > > > > results returned by
> > > > > >> > > > > > > > > > > >
> > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > >> > > > > > > > > > > > > > *is
> > > > > >> > > > > > > > > > > > > > > > different when we are calculating
> > > > > >> totalLogSize
> > > > > >> > > vs.
> > > > > >> > > > > when
> > > > > >> > > > > > > we
> > > > > >> > > > > > > > > are
> > > > > >> > > > > > > > > > > > > > > determining
> > > > > >> > > > > > > > > > > > > > > > the subset of segments to delete.
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > 3.
> > > > > >> > > > > > > > > > > > > > > > *"Also, what about time-based
> > > retention?
> > > > > To
> > > > > >> > make
> > > > > >> > > > that
> > > > > >> > > > > > > > > efficient,
> > > > > >> > > > > > > > > > > do
> > > > > >> > > > > > > > > > > > > we
> > > > > >> > > > > > > > > > > > > > > need
> > > > > >> > > > > > > > > > > > > > > > to make some additional interface
> > > > > >> changes?"*No.
> > > > > >> > > > Note
> > > > > >> > > > > > that
> > > > > >> > > > > > > > > time
> > > > > >> > > > > > > > > > > > > > complexity
> > > > > >> > > > > > > > > > > > > > > > to determine the segments for
> > > retention
> > > > is
> > > > > >> > > > different
> > > > > >> > > > > > for
> > > > > >> > > > > > > > time
> > > > > >> > > > > > > > > > > based
> > > > > >> > > > > > > > > > > > > vs.
> > > > > >> > > > > > > > > > > > > > > > size based. For time based, the
> time
> > > > > >> complexity
> > > > > >> > > is
> > > > > >> > > > a
> > > > > >> > > > > > > > > function of
> > > > > >> > > > > > > > > > > > the
> > > > > >> > > > > > > > > > > > > > > number
> > > > > >> > > > > > > > > > > > > > > > of segments which are "eligible
> for
> > > > > >> deletion"
> > > > > >> > > > (since
> > > > > >> > > > > we
> > > > > >> > > > > > > > only
> > > > > >> > > > > > > > > read
> > > > > >> > > > > > > > > > > > > > > metadata
> > > > > >> > > > > > > > > > > > > > > > for segments which would be
> deleted)
> > > > > >> whereas in
> > > > > >> > > > size
> > > > > >> > > > > > > based
> > > > > >> > > > > > > > > > > > retention,
> > > > > >> > > > > > > > > > > > > > the
> > > > > >> > > > > > > > > > > > > > > > time complexity is a function of
> "all
> > > > > >> segments"
> > > > > >> > > > > > available
> > > > > >> > > > > > > > in
> > > > > >> > > > > > > > > > > remote
> > > > > >> > > > > > > > > > > > > > > storage
> > > > > >> > > > > > > > > > > > > > > > (metadata of all segments needs
> to be
> > > > read
> > > > > >> to
> > > > > >> > > > > calculate
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > > total
> > > > > >> > > > > > > > > > > > > > size).
> > > > > >> > > > > > > > > > > > > > > As
> > > > > >> > > > > > > > > > > > > > > > you may observe, this KIP will
> bring
> > > the
> > > > > >> time
> > > > > >> > > > > > complexity
> > > > > >> > > > > > > > for
> > > > > >> > > > > > > > > both
> > > > > >> > > > > > > > > > > > > time
> > > > > >> > > > > > > > > > > > > > > > based retention & size based
> retention
> > > > to
> > > > > >> the
> > > > > >> > > same
> > > > > >> > > > > > > > function.
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > 4. Also, please note that this
> new API
> > > > > >> > introduced
> > > > > >> > > > in
> > > > > >> > > > > > this
> > > > > >> > > > > > > > KIP
> > > > > >> > > > > > > > > > > also
> > > > > >> > > > > > > > > > > > > > > enables
> > > > > >> > > > > > > > > > > > > > > > us to provide a metric for total
> size
> > > of
> > > > > >> data
> > > > > >> > > > stored
> > > > > >> > > > > in
> > > > > >> > > > > > > > > remote
> > > > > >> > > > > > > > > > > > > storage.
> > > > > >> > > > > > > > > > > > > > > > Without the API, calculation of
> this
> > > > > metric
> > > > > >> > will
> > > > > >> > > > > become
> > > > > >> > > > > > > > very
> > > > > >> > > > > > > > > > > > > expensive
> > > > > >> > > > > > > > > > > > > > > with
> > > > > >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> > > > > >> > > > > > > > > > > > > > > > I understand that your motivation
> here
> > > > is
> > > > > to
> > > > > >> > > avoid
> > > > > >> > > > > > > > polluting
> > > > > >> > > > > > > > > the
> > > > > >> > > > > > > > > > > > > > > interface
> > > > > >> > > > > > > > > > > > > > > > with optimization specific APIs
> and I
> > > > will
> > > > > >> > agree
> > > > > >> > > > with
> > > > > >> > > > > > > that
> > > > > >> > > > > > > > > goal.
> > > > > >> > > > > > > > > > > > But
> > > > > >> > > > > > > > > > > > > I
> > > > > >> > > > > > > > > > > > > > > > believe that this new API
> proposed in
> > > > the
> > > > > >> KIP
> > > > > >> > > > brings
> > > > > >> > > > > in
> > > > > >> > > > > > > > > > > significant
> > > > > >> > > > > > > > > > > > > > > > improvement and there is no other
> work
> > > > > >> around
> > > > > >> > > > > available
> > > > > >> > > > > > > to
> > > > > >> > > > > > > > > > > achieve
> > > > > >> > > > > > > > > > > > > the
> > > > > >> > > > > > > > > > > > > > > same
> > > > > >> > > > > > > > > > > > > > > > performance.
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > Regards,
> > > > > >> > > > > > > > > > > > > > > > Divij Vaidya
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM
> Jun
> > > Rao
> > > > > >> > > > > > > > > <jun@confluent.io.invalid
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > wrote:
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > > > > >> > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry for
> the
> > > late
> > > > > >> reply.
> > > > > >> > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > The motivation of the KIP is to
> > > > improve
> > > > > >> the
> > > > > >> > > > > > efficiency
> > > > > >> > > > > > > of
> > > > > >> > > > > > > > > size
> > > > > >> > > > > > > > > > > > > based
> > > > > >> > > > > > > > > > > > > > > > > retention. I am not sure the
> > > proposed
> > > > > >> changes
> > > > > >> > > are
> > > > > >> > > > > > > enough.
> > > > > >> > > > > > > > > For
> > > > > >> > > > > > > > > > > > > > example,
> > > > > >> > > > > > > > > > > > > > > if
> > > > > >> > > > > > > > > > > > > > > > > the size exceeds the retention
> size,
> > > > we
> > > > > >> need
> > > > > >> > to
> > > > > >> > > > > > > determine
> > > > > >> > > > > > > > > the
> > > > > >> > > > > > > > > > > > > subset
> > > > > >> > > > > > > > > > > > > > of
> > > > > >> > > > > > > > > > > > > > > > > segments to delete to bring the
> size
> > > > > >> within
> > > > > >> > the
> > > > > >> > > > > > > retention
> > > > > >> > > > > > > > > > > limit.
> > > > > >> > > > > > > > > > > > Do
> > > > > >> > > > > > > > > > > > > > we
> > > > > >> > > > > > > > > > > > > > > > need
> > > > > >> > > > > > > > > > > > > > > > > to call
> > > > > >> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > >> > > > > > > > to
> > > > > >> > > > > > > > > > > > > determine
> > > > > >> > > > > > > > > > > > > > > > that?
> > > > > >> > > > > > > > > > > > > > > > > Also, what about time-based
> > > retention?
> > > > > To
> > > > > >> > make
> > > > > >> > > > that
> > > > > >> > > > > > > > > efficient,
> > > > > >> > > > > > > > > > > do
> > > > > >> > > > > > > > > > > > > we
> > > > > >> > > > > > > > > > > > > > > need
> > > > > >> > > > > > > > > > > > > > > > > to make some additional
> interface
> > > > > changes?
> > > > > >> > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > An alternative approach is for
> the
> > > > RLMM
> > > > > >> > > > implementor
> > > > > >> > > > > > to
> > > > > >> > > > > > > > make
> > > > > >> > > > > > > > > > > sure
> > > > > >> > > > > > > > > > > > > > > > > that
> > > > > >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > >> > > > > > > is
> > > > > >> > > > > > > > > fast
> > > > > >> > > > > > > > > > > > > (e.g.,
> > > > > >> > > > > > > > > > > > > > > with
> > > > > >> > > > > > > > > > > > > > > > > local caching). This way, we
> could
> > > > keep
> > > > > >> the
> > > > > >> > > > > interface
> > > > > >> > > > > > > > > simple.
> > > > > >> > > > > > > > > > > > Have
> > > > > >> > > > > > > > > > > > > we
> > > > > >> > > > > > > > > > > > > > > > > considered that?
> > > > > >> > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > Thanks,
> > > > > >> > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > Jun
> > > > > >> > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28 AM
> > > Divij
> > > > > >> Vaidya
> > > > > >> > <
> > > > > >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > >> > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > Hey folks
> > > > > >> > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > Does anyone else have any
> thoughts
> > > > on
> > > > > >> this
> > > > > >> > > > > before I
> > > > > >> > > > > > > > > propose
> > > > > >> > > > > > > > > > > > this
> > > > > >> > > > > > > > > > > > > > for
> > > > > >> > > > > > > > > > > > > > > a
> > > > > >> > > > > > > > > > > > > > > > > > vote?
> > > > > >> > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > --
> > > > > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > >> > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57
> PM
> > > > Satish
> > > > > >> > > Duggana
> > > > > >> > > > <
> > > > > >> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> > > > > >> > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > wrote:
> > > > > >> > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP Divij!
> > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > This is a nice improvement
> to
> > > > avoid
> > > > > >> > > > > recalculation
> > > > > >> > > > > > > of
> > > > > >> > > > > > > > > size.
> > > > > >> > > > > > > > > > > > > > > Customized
> > > > > >> > > > > > > > > > > > > > > > > > RLMMs
> > > > > >> > > > > > > > > > > > > > > > > > > can implement the best
> possible
> > > > > >> approach
> > > > > >> > by
> > > > > >> > > > > > caching
> > > > > >> > > > > > > > or
> > > > > >> > > > > > > > > > > > > > maintaining
> > > > > >> > > > > > > > > > > > > > > > the
> > > > > >> > > > > > > > > > > > > > > > > > size
> > > > > >> > > > > > > > > > > > > > > > > > > in an efficient way. But
> this is
> > > > > not a
> > > > > >> > big
> > > > > >> > > > > > concern
> > > > > >> > > > > > > > for
> > > > > >> > > > > > > > > the
> > > > > >> > > > > > > > > > > > > > default
> > > > > >> > > > > > > > > > > > > > > > > topic
> > > > > >> > > > > > > > > > > > > > > > > > > based RLMM as mentioned in
> the
> > > > KIP.
> > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at
> 18:48,
> > > > Divij
> > > > > >> > Vaidya
> > > > > >> > > <
> > > > > >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > > >> > > > > > > > > > > > > > > > > > > wrote:
> > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > Thank you for your review
> > > Luke.
> > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that would the
> new
> > > > > >> > > > > > `RemoteLogSizeBytes`
> > > > > >> > > > > > > > > metric
> > > > > >> > > > > > > > > > > > be a
> > > > > >> > > > > > > > > > > > > > > > > > performance
> > > > > >> > > > > > > > > > > > > > > > > > > > overhead? Although we
> move the
> > > > > >> > > calculation
> > > > > >> > > > > to a
> > > > > >> > > > > > > > > seperate
> > > > > >> > > > > > > > > > > > API,
> > > > > >> > > > > > > > > > > > > > we
> > > > > >> > > > > > > > > > > > > > > > > still
> > > > > >> > > > > > > > > > > > > > > > > > > > can't assume users will
> > > > implement
> > > > > a
> > > > > >> > > > > > light-weight
> > > > > >> > > > > > > > > method,
> > > > > >> > > > > > > > > > > > > right?
> > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > This metric would be
> logged
> > > > using
> > > > > >> the
> > > > > >> > > > > > information
> > > > > >> > > > > > > > > that is
> > > > > >> > > > > > > > > > > > > > already
> > > > > >> > > > > > > > > > > > > > > > > being
> > > > > >> > > > > > > > > > > > > > > > > > > > calculated for handling
> remote
> > > > > >> > retention
> > > > > >> > > > > logic,
> > > > > >> > > > > > > > > hence, no
> > > > > >> > > > > > > > > > > > > > > > additional
> > > > > >> > > > > > > > > > > > > > > > > > work
> > > > > >> > > > > > > > > > > > > > > > > > > > is required to calculate
> this
> > > > > >> metric.
> > > > > >> > > More
> > > > > >> > > > > > > > > specifically,
> > > > > >> > > > > > > > > > > > > > whenever
> > > > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls
> > > > > >> getRemoteLogSize
> > > > > >> > > > API,
> > > > > >> > > > > > this
> > > > > >> > > > > > > > > metric
> > > > > >> > > > > > > > > > > > > would
> > > > > >> > > > > > > > > > > > > > be
> > > > > >> > > > > > > > > > > > > > > > > > > captured.
> > > > > >> > > > > > > > > > > > > > > > > > > > This API call is made
> every
> > > time
> > > > > >> > > > > > RemoteLogManager
> > > > > >> > > > > > > > > wants
> > > > > >> > > > > > > > > > > to
> > > > > >> > > > > > > > > > > > > > handle
> > > > > >> > > > > > > > > > > > > > > > > > expired
> > > > > >> > > > > > > > > > > > > > > > > > > > remote log segments (which
> > > > should
> > > > > be
> > > > > >> > > > > periodic).
> > > > > >> > > > > > > > Does
> > > > > >> > > > > > > > > that
> > > > > >> > > > > > > > > > > > > > address
> > > > > >> > > > > > > > > > > > > > > > > your
> > > > > >> > > > > > > > > > > > > > > > > > > > concern?
> > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at
> 11:01
> > > AM
> > > > > >> Luke
> > > > > >> > > Chen
> > > > > >> > > > <
> > > > > >> > > > > > > > > > > > > showuon@gmail.com>
> > > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP!
> > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > I think it makes sense
> to
> > > > > delegate
> > > > > >> > the
> > > > > >> > > > > > > > > responsibility
> > > > > >> > > > > > > > > > > of
> > > > > >> > > > > > > > > > > > > > > > > calculation
> > > > > >> > > > > > > > > > > > > > > > > > to
> > > > > >> > > > > > > > > > > > > > > > > > > > the
> > > > > >> > > > > > > > > > > > > > > > > > > > > specific
> > > > > RemoteLogMetadataManager
> > > > > >> > > > > > > implementation.
> > > > > >> > > > > > > > > > > > > > > > > > > > > But one thing I'm not
> quite
> > > > > sure,
> > > > > >> is
> > > > > >> > > that
> > > > > >> > > > > > would
> > > > > >> > > > > > > > > the new
> > > > > >> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes`
> metric
> > > > be a
> > > > > >> > > > > performance
> > > > > >> > > > > > > > > overhead?
> > > > > >> > > > > > > > > > > > > > > > > > > > > Although we move the
> > > > calculation
> > > > > >> to a
> > > > > >> > > > > > seperate
> > > > > >> > > > > > > > > API, we
> > > > > >> > > > > > > > > > > > > still
> > > > > >> > > > > > > > > > > > > > > > can't
> > > > > >> > > > > > > > > > > > > > > > > > > assume
> > > > > >> > > > > > > > > > > > > > > > > > > > > users will implement a
> > > > > >> light-weight
> > > > > >> > > > method,
> > > > > >> > > > > > > > right?
> > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > >> > > > > > > > > > > > > > > > > > > > > Luke
> > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at
> 5:47
> > > PM
> > > > > >> Divij
> > > > > >> > > > > Vaidya <
> > > > > >> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > > Please take a look at
> this
> > > > KIP
> > > > > >> > which
> > > > > >> > > > > > proposes
> > > > > >> > > > > > > > an
> > > > > >> > > > > > > > > > > > > extension
> > > > > >> > > > > > > > > > > > > > to
> > > > > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > > > > >> > > > > > > > > > > > > > > > > > > > > This
> > > > > >> > > > > > > > > > > > > > > > > > > > > > is my first KIP with
> > > Apache
> > > > > >> Kafka
> > > > > >> > > > > community
> > > > > >> > > > > > > so
> > > > > >> > > > > > > > > any
> > > > > >> > > > > > > > > > > > > feedback
> > > > > >> > > > > > > > > > > > > > > > would
> > > > > >> > > > > > > > > > > > > > > > > > be
> > > > > >> > > > > > > > > > > > > > > > > > > > > highly
> > > > > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > > > --
> > > > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software Engineer
> > > > > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Satish Duggana <sa...@gmail.com>.
Hi Divij,
I am fine with having an API to compute the size as I mentioned in my
earlier reply in this mail thread. But I have the below comment for
the motivation for this KIP.

As you discussed offline, the main issue here is listing calls for
remote log segment metadata is slower because of the storage used for
RLMM. These can be avoided with this new API.

Please add this in the motivation section as it is one of the main
motivations for the KIP.

Thanks,
Satish.

On Sat, 1 Jul 2023 at 01:43, Jun Rao <ju...@confluent.io.invalid> wrote:
>
> Hi, Divij,
>
> Sorry for the late reply.
>
> Given your explanation, the new API sounds reasonable to me. Is that enough
> to build the external metadata layer for the remote segments or do you need
> some additional API changes?
>
> Thanks,
>
> Jun
>
> On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <di...@gmail.com> wrote:
>
> > Thank you for looking into this Kamal.
> >
> > You are right in saying that a cold start (i.e. leadership failover or
> > broker startup) does not impact the broker startup duration. But it does
> > have the following impact:
> > 1. It leads to a burst of full-scan requests to RLMM in case multiple
> > leadership failovers occur at the same time. Even if the RLMM
> > implementation has the capability to serve the total size from an index
> > (and hence handle this burst), we wouldn't be able to use it since the
> > current API necessarily calls for a full scan.
> > 2. The archival (copying of data to tiered storage) process will have a
> > delayed start. The delayed start of archival could lead to local build up
> > of data which may lead to disk full.
> >
> > The disadvantage of adding this new API is that every provider will have to
> > implement it, agreed. But I believe that this tradeoff is worthwhile since
> > the default implementation could be the same as you mentioned, i.e. keeping
> > cumulative in-memory count.
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
> > kamal.chandraprakash@gmail.com> wrote:
> >
> > > Hi Divij,
> > >
> > > Thanks for the KIP! Sorry for the late reply.
> > >
> > > Can you explain the rejected alternative-3?
> > > Store the cumulative size of remote tier log in-memory at
> > RemoteLogManager
> > > "*Cons*: Every time a broker starts-up, it will scan through all the
> > > segments in the remote tier to initialise the in-memory value. This would
> > > increase the broker start-up time."
> > >
> > > Keeping the source of truth to determine the remote-log-size in the
> > leader
> > > would be consistent across different implementations of the plugin. The
> > > concern posted in the KIP is that we are calculating the remote-log-size
> > on
> > > each iteration of the cleaner thread (say 5 mins). If we calculate only
> > > once during broker startup or during the leadership reassignment, do we
> > > still need the cache?
> > >
> > > The broker startup-time won't be affected by the remote log manager
> > > initialisation. The broker continue to start accepting the new
> > > produce/fetch requests, while the RLM thread in the background can
> > > determine the remote-log-size once and start copying/deleting the
> > segments.
> > >
> > > Thanks,
> > > Kamal
> > >
> > > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <di...@gmail.com>
> > > wrote:
> > >
> > > > Satish / Jun
> > > >
> > > > Do you have any thoughts on this?
> > > >
> > > > --
> > > > Divij Vaidya
> > > >
> > > >
> > > >
> > > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <di...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hey Jun
> > > > >
> > > > > It has been a while since this KIP got some attention. While we wait
> > > for
> > > > > Satish to chime in here, perhaps I can answer your question.
> > > > >
> > > > > > Could you explain how you exposed the log size in your KIP-405
> > > > > implementation?
> > > > >
> > > > > The APIs available in RLMM as per KIP405
> > > > > are, addRemoteLogSegmentMetadata(), updateRemoteLogSegmentMetadata(),
> > > > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > > > putRemotePartitionDeleteMetadata(), listRemoteLogSegments(),
> > > > onPartitionLeadershipChanges()
> > > > > and onStopPartitions(). None of these APIs allow us to expose the log
> > > > size,
> > > > > hence, the only option that remains is to list all segments using
> > > > > listRemoteLogSegments() and aggregate them every time we require to
> > > > > calculate the size. Based on our prior discussion, this requires
> > > reading
> > > > > all segment metadata which won't work for non-local RLMM
> > > implementations.
> > > > > Satish's implementation also performs a full scan and calculates the
> > > > > aggregate. see:
> > > > >
> > > >
> > >
> > https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > > > >
> > > > >
> > > > > Does this answer your question?
> > > > >
> > > > > --
> > > > > Divij Vaidya
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao <ju...@confluent.io.invalid>
> > > > wrote:
> > > > >
> > > > >> Hi, Divij,
> > > > >>
> > > > >> Thanks for the explanation.
> > > > >>
> > > > >> Good question.
> > > > >>
> > > > >> Hi, Satish,
> > > > >>
> > > > >> Could you explain how you exposed the log size in your KIP-405
> > > > >> implementation?
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Jun
> > > > >>
> > > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <
> > divijvaidya13@gmail.com
> > > >
> > > > >> wrote:
> > > > >>
> > > > >> > Hey Jun
> > > > >> >
> > > > >> > Yes, it is possible to maintain the log size in the cache (see
> > > > rejected
> > > > >> > alternative#3 in the KIP) but I did not understand how it is
> > > possible
> > > > to
> > > > >> > retrieve it without the new API. The log size could be calculated
> > on
> > > > >> > startup by scanning through the segments (though I would disagree
> > > that
> > > > >> this
> > > > >> > is the right approach since scanning itself takes order of minutes
> > > and
> > > > >> > hence delay the start of archive process), and incrementally
> > > > maintained
> > > > >> > afterwards, even then, we would need an API in
> > > > RemoteLogMetadataManager
> > > > >> so
> > > > >> > that RLM could fetch the cached size!
> > > > >> >
> > > > >> > If we wish to cache the size without adding a new API, then we
> > need
> > > to
> > > > >> > cache the size in RLM itself (instead of RLMM implementation) and
> > > > >> > incrementally manage it. The downside of longer archive time at
> > > > startup
> > > > >> > (due to initial scale) still remains valid in this situation.
> > > > >> >
> > > > >> > --
> > > > >> > Divij Vaidya
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao <jun@confluent.io.invalid
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > Hi, Divij,
> > > > >> > >
> > > > >> > > Thanks for the explanation.
> > > > >> > >
> > > > >> > > If there is in-memory cache, could we maintain the log size in
> > the
> > > > >> cache
> > > > >> > > with the existing API? For example, a replica could make a
> > > > >> > > listRemoteLogSegments(TopicIdPartition topicIdPartition) call on
> > > > >> startup
> > > > >> > to
> > > > >> > > get the remote segment size before the current leaderEpoch. The
> > > > leader
> > > > >> > > could then maintain the size incrementally afterwards. On leader
> > > > >> change,
> > > > >> > > other replicas can make a listRemoteLogSegments(TopicIdPartition
> > > > >> > > topicIdPartition, int leaderEpoch) call to get the size of newly
> > > > >> > generated
> > > > >> > > segments.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > >
> > > > >> > > Jun
> > > > >> > >
> > > > >> > >
> > > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > > > divijvaidya13@gmail.com
> > > > >> >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > > Is the new method enough for doing size-based retention?
> > > > >> > > >
> > > > >> > > > Yes. You are right in assuming that this API only provides the
> > > > >> Remote
> > > > >> > > > storage size (for current epoch chain). We would use this API
> > > for
> > > > >> size
> > > > >> > > > based retention along with a value of localOnlyLogSegmentSize
> > > > which
> > > > >> is
> > > > >> > > > computed as Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > > >> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size =
> > > > >> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I have
> > > updated
> > > > >> the
> > > > >> > KIP
> > > > >> > > > with this information. You can also check an example
> > > > implementation
> > > > >> at
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> > https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > > Do you imagine all accesses to remote metadata will be
> > across
> > > > the
> > > > >> > > network
> > > > >> > > > or will there be some local in-memory cache?
> > > > >> > > >
> > > > >> > > > I would expect a disk-less implementation to maintain a finite
> > > > >> > in-memory
> > > > >> > > > cache for segment metadata to optimize the number of network
> > > calls
> > > > >> made
> > > > >> > > to
> > > > >> > > > fetch the data. In future, we can think about bringing this
> > > finite
> > > > >> size
> > > > >> > > > cache into RLM itself but that's probably a conversation for a
> > > > >> > different
> > > > >> > > > KIP. There are many other things we would like to do to
> > optimize
> > > > the
> > > > >> > > Tiered
> > > > >> > > > storage interface such as introducing a circular buffer /
> > > > streaming
> > > > >> > > > interface from RSM (so that we don't have to wait to fetch the
> > > > >> entire
> > > > >> > > > segment before starting to send records to the consumer),
> > > caching
> > > > >> the
> > > > >> > > > segments fetched from RSM locally (I would assume all RSM
> > plugin
> > > > >> > > > implementations to do this, might as well add it to RLM) etc.
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > > Divij Vaidya
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> > > <jun@confluent.io.invalid
> > > > >
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi, Divij,
> > > > >> > > > >
> > > > >> > > > > Thanks for the reply.
> > > > >> > > > >
> > > > >> > > > > Is the new method enough for doing size-based retention? It
> > > > gives
> > > > >> the
> > > > >> > > > total
> > > > >> > > > > size of the remote segments, but it seems that we still
> > don't
> > > > know
> > > > >> > the
> > > > >> > > > > exact total size for a log since there could be overlapping
> > > > >> segments
> > > > >> > > > > between the remote and the local segments.
> > > > >> > > > >
> > > > >> > > > > You mentioned a disk-less implementation. Do you imagine all
> > > > >> accesses
> > > > >> > > to
> > > > >> > > > > remote metadata will be across the network or will there be
> > > some
> > > > >> > local
> > > > >> > > > > in-memory cache?
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > >
> > > > >> > > > > Jun
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > > > >> divijvaidya13@gmail.com
> > > > >> > >
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > The method is needed for RLMM implementations which fetch
> > > the
> > > > >> > > > information
> > > > >> > > > > > over the network and not for the disk based
> > implementations
> > > > >> (such
> > > > >> > as
> > > > >> > > > the
> > > > >> > > > > > default topic based RLMM).
> > > > >> > > > > >
> > > > >> > > > > > I would argue that adding this API makes the interface
> > more
> > > > >> generic
> > > > >> > > > than
> > > > >> > > > > > what it is today. This is because, with the current APIs
> > an
> > > > >> > > implementor
> > > > >> > > > > is
> > > > >> > > > > > restricted to use disk based RLMM solutions only (i.e. the
> > > > >> default
> > > > >> > > > > > solution) whereas if we add this new API, we unblock usage
> > > of
> > > > >> > network
> > > > >> > > > > based
> > > > >> > > > > > RLMM implementations such as databases.
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > > > <jun@confluent.io.invalid
> > > > >> >
> > > > >> > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Hi, Divij,
> > > > >> > > > > > >
> > > > >> > > > > > > Thanks for the reply.
> > > > >> > > > > > >
> > > > >> > > > > > > Point#2. My high level question is that is the new
> > method
> > > > >> needed
> > > > >> > > for
> > > > >> > > > > > every
> > > > >> > > > > > > implementation of remote storage or just for a specific
> > > > >> > > > implementation.
> > > > >> > > > > > The
> > > > >> > > > > > > issues that you pointed out exist for the default
> > > > >> implementation
> > > > >> > of
> > > > >> > > > > RLMM
> > > > >> > > > > > as
> > > > >> > > > > > > well and so far, the default implementation hasn't
> > found a
> > > > >> need
> > > > >> > > for a
> > > > >> > > > > > > similar new method. For public interface, ideally we
> > want
> > > to
> > > > >> make
> > > > >> > > it
> > > > >> > > > > more
> > > > >> > > > > > > general.
> > > > >> > > > > > >
> > > > >> > > > > > > Thanks,
> > > > >> > > > > > >
> > > > >> > > > > > > Jun
> > > > >> > > > > > >
> > > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
> > > > >> > > > divijvaidya13@gmail.com>
> > > > >> > > > > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Thank you Jun and Alex for your comments.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Point#1: You are right Jun. As Alex mentioned, the
> > > > "derived
> > > > >> > > > metadata"
> > > > >> > > > > > can
> > > > >> > > > > > > > increase the size of cached metadata by a factor of 10
> > > but
> > > > >> it
> > > > >> > > > should
> > > > >> > > > > be
> > > > >> > > > > > > ok
> > > > >> > > > > > > > to cache just the actual metadata. My point about size
> > > > >> being a
> > > > >> > > > > > limitation
> > > > >> > > > > > > > for using cache is not valid anymore.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Point#2: For a new replica, it would still have to
> > fetch
> > > > the
> > > > >> > > > metadata
> > > > >> > > > > > > over
> > > > >> > > > > > > > the network to initiate the warm up of the cache and
> > > > hence,
> > > > >> > > > increase
> > > > >> > > > > > the
> > > > >> > > > > > > > start time of the archival process. Please also note
> > the
> > > > >> > > > > repercussions
> > > > >> > > > > > of
> > > > >> > > > > > > > the warm up scan that Alex mentioned in this thread as
> > > > part
> > > > >> of
> > > > >> > > > > #102.2.
> > > > >> > > > > > > >
> > > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that. My
> > point
> > > > >> about
> > > > >> > > size
> > > > >> > > > > > being
> > > > >> > > > > > > a
> > > > >> > > > > > > > limitation for using cache is not valid anymore.
> > > > >> > > > > > > >
> > > > >> > > > > > > > 101#: Alex, if I understand correctly, you are
> > > suggesting
> > > > to
> > > > >> > > cache
> > > > >> > > > > the
> > > > >> > > > > > > > total size at the leader and update it on archival.
> > This
> > > > >> > wouldn't
> > > > >> > > > > work
> > > > >> > > > > > > for
> > > > >> > > > > > > > cases when the leader restarts where we would have to
> > > > make a
> > > > >> > full
> > > > >> > > > > scan
> > > > >> > > > > > > > to update the total size entry on startup. We expect
> > > users
> > > > >> to
> > > > >> > > store
> > > > >> > > > > > data
> > > > >> > > > > > > > over longer duration in remote storage which increases
> > > the
> > > > >> > > > likelihood
> > > > >> > > > > > of
> > > > >> > > > > > > > leader restarts / failovers.
> > > > >> > > > > > > >
> > > > >> > > > > > > > 102#.1: I don't think that the current design
> > > accommodates
> > > > >> the
> > > > >> > > fact
> > > > >> > > > > > that
> > > > >> > > > > > > > data corruption could happen at the RLMM plugin (we
> > > don't
> > > > >> have
> > > > >> > > > > checksum
> > > > >> > > > > > > as
> > > > >> > > > > > > > a field in metadata as part of KIP405). If data
> > > corruption
> > > > >> > > occurs,
> > > > >> > > > w/
> > > > >> > > > > > or
> > > > >> > > > > > > > w/o the cache, it would be a different problem to
> > > solve. I
> > > > >> > would
> > > > >> > > > like
> > > > >> > > > > > to
> > > > >> > > > > > > > keep this outside the scope of this KIP.
> > > > >> > > > > > > >
> > > > >> > > > > > > > 102#.2: Agree. This remains as the main concern for
> > > using
> > > > >> the
> > > > >> > > cache
> > > > >> > > > > to
> > > > >> > > > > > > > fetch total size.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Regards,
> > > > >> > > > > > > > Divij Vaidya
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre Dupriez <
> > > > >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Hi Divij,
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Thanks for the KIP. Please find some comments based
> > on
> > > > >> what I
> > > > >> > > > read
> > > > >> > > > > on
> > > > >> > > > > > > > > this thread so far - apologies for the repeats and
> > the
> > > > >> late
> > > > >> > > > reply.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > If I understand correctly, one of the main elements
> > of
> > > > >> > > discussion
> > > > >> > > > > is
> > > > >> > > > > > > > > about caching in Kafka versus delegation of
> > providing
> > > > the
> > > > >> > > remote
> > > > >> > > > > size
> > > > >> > > > > > > > > of a topic-partition to the plugin.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > A few comments:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 100. The size of the “derived metadata” which is
> > > managed
> > > > >> by
> > > > >> > the
> > > > >> > > > > > plugin
> > > > >> > > > > > > > > to represent an rlmMetadata can indeed be close to 1
> > > kB
> > > > on
> > > > >> > > > average
> > > > >> > > > > > > > > depending on its own internal structure, e.g. the
> > > > >> redundancy
> > > > >> > it
> > > > >> > > > > > > > > enforces (unfortunately resulting to duplication),
> > > > >> additional
> > > > >> > > > > > > > > information such as checksums and primary and
> > > secondary
> > > > >> > > indexable
> > > > >> > > > > > > > > keys. But indeed, the rlmMetadata is itself a
> > lighter
> > > > data
> > > > >> > > > > structure
> > > > >> > > > > > > > > by a factor of 10. And indeed, instead of caching
> > the
> > > > >> > “derived
> > > > >> > > > > > > > > metadata”, only the rlmMetadata could be, which
> > should
> > > > >> > address
> > > > >> > > > the
> > > > >> > > > > > > > > concern regarding the memory occupancy of the cache.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 101. I am not sure I fully understand why we would
> > > need
> > > > to
> > > > >> > > cache
> > > > >> > > > > the
> > > > >> > > > > > > > > list of rlmMetadata to retain the remote size of a
> > > > >> > > > topic-partition.
> > > > >> > > > > > > > > Since the leader of a topic-partition is, in
> > > > >> non-degenerated
> > > > >> > > > cases,
> > > > >> > > > > > > > > the only actor which can mutate the remote part of
> > the
> > > > >> > > > > > > > > topic-partition, hence its size, it could in theory
> > > only
> > > > >> > cache
> > > > >> > > > the
> > > > >> > > > > > > > > size of the remote log once it has calculated it? In
> > > > which
> > > > >> > case
> > > > >> > > > > there
> > > > >> > > > > > > > > would not be any problem regarding the size of the
> > > > caching
> > > > >> > > > > strategy.
> > > > >> > > > > > > > > Did I miss something there?
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 102. There may be a few challenges to consider with
> > > > >> caching:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 102.1) As mentioned above, the caching strategy
> > > assumes
> > > > no
> > > > >> > > > mutation
> > > > >> > > > > > > > > outside the lifetime of a leader. While this is true
> > > in
> > > > >> the
> > > > >> > > > normal
> > > > >> > > > > > > > > course of operation, there could be accidental
> > > mutation
> > > > >> > outside
> > > > >> > > > of
> > > > >> > > > > > the
> > > > >> > > > > > > > > leader and a loss of consistency between the cached
> > > > state
> > > > >> and
> > > > >> > > the
> > > > >> > > > > > > > > actual remote representation of the log. E.g.
> > > > split-brain
> > > > >> > > > > scenarios,
> > > > >> > > > > > > > > bugs in the plugins, bugs in external systems with
> > > > >> mutating
> > > > >> > > > access
> > > > >> > > > > on
> > > > >> > > > > > > > > the derived metadata. In the worst case, a drift
> > > between
> > > > >> the
> > > > >> > > > cached
> > > > >> > > > > > > > > size and the actual size could lead to over-deleting
> > > > >> remote
> > > > >> > > data
> > > > >> > > > > > which
> > > > >> > > > > > > > > is a durability risk.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > The alternative you propose, by making the plugin
> > the
> > > > >> source
> > > > >> > of
> > > > >> > > > > truth
> > > > >> > > > > > > > > w.r.t. to the size of the remote log, can make it
> > > easier
> > > > >> to
> > > > >> > > avoid
> > > > >> > > > > > > > > inconsistencies between plugin-managed metadata and
> > > the
> > > > >> > remote
> > > > >> > > > log
> > > > >> > > > > > > > > from the perspective of Kafka. On the other hand,
> > > plugin
> > > > >> > > vendors
> > > > >> > > > > > would
> > > > >> > > > > > > > > have to implement it with the expected efficiency to
> > > > have
> > > > >> it
> > > > >> > > > yield
> > > > >> > > > > > > > > benefits.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 102.2) As you mentioned, the caching strategy in
> > Kafka
> > > > >> would
> > > > >> > > > still
> > > > >> > > > > > > > > require one iteration over the list of rlmMetadata
> > > when
> > > > >> the
> > > > >> > > > > > leadership
> > > > >> > > > > > > > > of a topic-partition is assigned to a broker, while
> > > the
> > > > >> > plugin
> > > > >> > > > can
> > > > >> > > > > > > > > offer alternative constant-time approaches. This
> > > > >> calculation
> > > > >> > > > cannot
> > > > >> > > > > > be
> > > > >> > > > > > > > > put on the LeaderAndIsr path and would be performed
> > in
> > > > the
> > > > >> > > > > > background.
> > > > >> > > > > > > > > In case of bulk leadership migration, listing the
> > > > >> rlmMetadata
> > > > >> > > > could
> > > > >> > > > > > a)
> > > > >> > > > > > > > > result in request bursts to any backend system the
> > > > plugin
> > > > >> may
> > > > >> > > use
> > > > >> > > > > > > > > [which shouldn’t be a problem for high-throughput
> > data
> > > > >> stores
> > > > >> > > but
> > > > >> > > > > > > > > could have cost implications] b) increase
> > utilisation
> > > > >> > timespan
> > > > >> > > of
> > > > >> > > > > the
> > > > >> > > > > > > > > RLM threads for these calculations potentially
> > leading
> > > > to
> > > > >> > > > transient
> > > > >> > > > > > > > > starvation of tasks queued for, typically,
> > offloading
> > > > >> > > operations
> > > > >> > > > c)
> > > > >> > > > > > > > > could have a non-marginal CPU footprint on hardware
> > > with
> > > > >> > strict
> > > > >> > > > > > > > > resource constraints. All these elements could have
> > an
> > > > >> impact
> > > > >> > > to
> > > > >> > > > > some
> > > > >> > > > > > > > > degree depending on the operational environment.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > From a design perspective, one question is where we
> > > want
> > > > >> the
> > > > >> > > > source
> > > > >> > > > > > of
> > > > >> > > > > > > > > truth w.r.t. remote log size to be during the
> > lifetime
> > > > of
> > > > >> a
> > > > >> > > > leader.
> > > > >> > > > > > > > > The responsibility of maintaining a consistent
> > > > >> representation
> > > > >> > > of
> > > > >> > > > > the
> > > > >> > > > > > > > > remote log is shared by Kafka and the plugin. Which
> > > > >> system is
> > > > >> > > > best
> > > > >> > > > > > > > > placed to maintain such a state while providing the
> > > > >> highest
> > > > >> > > > > > > > > consistency guarantees is something both Kafka and
> > > > plugin
> > > > >> > > > designers
> > > > >> > > > > > > > > could help understand better.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Many thanks,
> > > > >> > > > > > > > > Alexandre
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > > > >> > <jun@confluent.io.invalid
> > > > >> > > >
> > > > >> > > > a
> > > > >> > > > > > > > écrit :
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Hi, Divij,
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Thanks for the reply.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Point #1. Is the average remote segment metadata
> > > > really
> > > > >> > 1KB?
> > > > >> > > > > What's
> > > > >> > > > > > > > > listed
> > > > >> > > > > > > > > > in the public interface is probably well below 100
> > > > >> bytes.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Point #2. I guess you are assuming that each
> > broker
> > > > only
> > > > >> > > caches
> > > > >> > > > > the
> > > > >> > > > > > > > > remote
> > > > >> > > > > > > > > > segment metadata in memory. An alternative
> > approach
> > > is
> > > > >> to
> > > > >> > > cache
> > > > >> > > > > > them
> > > > >> > > > > > > in
> > > > >> > > > > > > > > > both memory and local disk. That way, on broker
> > > > restart,
> > > > >> > you
> > > > >> > > > just
> > > > >> > > > > > > need
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > > fetch the new remote segments' metadata using the
> > > > >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > > topicIdPartition,
> > > > >> > int
> > > > >> > > > > > > > leaderEpoch)
> > > > >> > > > > > > > > > api. Will that work?
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Point #3. Thanks for the explanation and it sounds
> > > > good.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Thanks,
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Jun
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij Vaidya <
> > > > >> > > > > > > divijvaidya13@gmail.com>
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > > Hi Jun
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > There are three points that I would like to
> > > present
> > > > >> here:
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > 1. We would require a large cache size to
> > > > efficiently
> > > > >> > cache
> > > > >> > > > all
> > > > >> > > > > > > > segment
> > > > >> > > > > > > > > > > metadata.
> > > > >> > > > > > > > > > > 2. Linear scan of all metadata at broker startup
> > > to
> > > > >> > > populate
> > > > >> > > > > the
> > > > >> > > > > > > > cache
> > > > >> > > > > > > > > will
> > > > >> > > > > > > > > > > be slow and will impact the archival process.
> > > > >> > > > > > > > > > > 3. There is no other use case where a full scan
> > of
> > > > >> > segment
> > > > >> > > > > > metadata
> > > > >> > > > > > > > is
> > > > >> > > > > > > > > > > required.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Let's start by quantifying 1. Here's my estimate
> > > for
> > > > >> the
> > > > >> > > size
> > > > >> > > > > of
> > > > >> > > > > > > the
> > > > >> > > > > > > > > cache.
> > > > >> > > > > > > > > > > Average size of segment metadata = 1KB. This
> > could
> > > > be
> > > > >> > more
> > > > >> > > if
> > > > >> > > > > we
> > > > >> > > > > > > have
> > > > >> > > > > > > > > > > frequent leader failover with a large number of
> > > > leader
> > > > >> > > epochs
> > > > >> > > > > > being
> > > > >> > > > > > > > > stored
> > > > >> > > > > > > > > > > per segment.
> > > > >> > > > > > > > > > > Segment size = 100MB. Users will prefer to
> > reduce
> > > > the
> > > > >> > > segment
> > > > >> > > > > > size
> > > > >> > > > > > > > > from the
> > > > >> > > > > > > > > > > default value of 1GB to ensure timely archival
> > of
> > > > data
> > > > >> > > since
> > > > >> > > > > data
> > > > >> > > > > > > > from
> > > > >> > > > > > > > > > > active segment is not archived.
> > > > >> > > > > > > > > > > Cache size = num segments * avg. segment
> > metadata
> > > > >> size =
> > > > >> > > > > > > > > (100TB/100MB)*1KB
> > > > >> > > > > > > > > > > = 1GB.
> > > > >> > > > > > > > > > > While 1GB for cache may not sound like a large
> > > > number
> > > > >> for
> > > > >> > > > > larger
> > > > >> > > > > > > > > machines,
> > > > >> > > > > > > > > > > it does eat into the memory as an additional
> > cache
> > > > and
> > > > >> > > makes
> > > > >> > > > > use
> > > > >> > > > > > > > cases
> > > > >> > > > > > > > > with
> > > > >> > > > > > > > > > > large data retention with low throughout
> > expensive
> > > > >> (where
> > > > >> > > > such
> > > > >> > > > > > use
> > > > >> > > > > > > > case
> > > > >> > > > > > > > > > > would could use smaller machines).
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > About point#2:
> > > > >> > > > > > > > > > > Even if we say that all segment metadata can fit
> > > > into
> > > > >> the
> > > > >> > > > > cache,
> > > > >> > > > > > we
> > > > >> > > > > > > > > will
> > > > >> > > > > > > > > > > need to populate the cache on broker startup. It
> > > > would
> > > > >> > not
> > > > >> > > be
> > > > >> > > > > in
> > > > >> > > > > > > the
> > > > >> > > > > > > > > > > critical patch of broker startup and hence won't
> > > > >> impact
> > > > >> > the
> > > > >> > > > > > startup
> > > > >> > > > > > > > > time.
> > > > >> > > > > > > > > > > But it will impact the time when we could start
> > > the
> > > > >> > > archival
> > > > >> > > > > > > process
> > > > >> > > > > > > > > since
> > > > >> > > > > > > > > > > the RLM thread pool will be blocked on the first
> > > > call
> > > > >> to
> > > > >> > > > > > > > > > > listRemoteLogSegments(). To scan metadata for
> > 1MM
> > > > >> > segments
> > > > >> > > > > > > (computed
> > > > >> > > > > > > > > above)
> > > > >> > > > > > > > > > > and transfer 1GB data over the network from a
> > RLMM
> > > > >> such
> > > > >> > as
> > > > >> > > a
> > > > >> > > > > > remote
> > > > >> > > > > > > > > > > database would be in the order of minutes
> > > (depending
> > > > >> on
> > > > >> > how
> > > > >> > > > > > > efficient
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > > scan is with the RLMM implementation).
> > Although, I
> > > > >> would
> > > > >> > > > > concede
> > > > >> > > > > > > that
> > > > >> > > > > > > > > > > having RLM threads blocked for a few minutes is
> > > > >> perhaps
> > > > >> > OK
> > > > >> > > > but
> > > > >> > > > > if
> > > > >> > > > > > > we
> > > > >> > > > > > > > > > > introduce the new API proposed in the KIP, we
> > > would
> > > > >> have
> > > > >> > a
> > > > >> > > > > > > > > > > deterministic startup time for RLM. Adding the
> > API
> > > > >> comes
> > > > >> > > at a
> > > > >> > > > > low
> > > > >> > > > > > > > cost
> > > > >> > > > > > > > > and
> > > > >> > > > > > > > > > > I believe the trade off is worth it.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > About point#3:
> > > > >> > > > > > > > > > > We can use
> > listRemoteLogSegments(TopicIdPartition
> > > > >> > > > > > topicIdPartition,
> > > > >> > > > > > > > int
> > > > >> > > > > > > > > > > leaderEpoch) to calculate the segments eligible
> > > for
> > > > >> > > deletion
> > > > >> > > > > > (based
> > > > >> > > > > > > > on
> > > > >> > > > > > > > > size
> > > > >> > > > > > > > > > > retention) where leader epoch(s) belong to the
> > > > current
> > > > >> > > leader
> > > > >> > > > > > epoch
> > > > >> > > > > > > > > chain.
> > > > >> > > > > > > > > > > I understand that it may lead to segments
> > > belonging
> > > > to
> > > > >> > > other
> > > > >> > > > > > epoch
> > > > >> > > > > > > > > lineage
> > > > >> > > > > > > > > > > not getting deleted and would require a separate
> > > > >> > mechanism
> > > > >> > > to
> > > > >> > > > > > > delete
> > > > >> > > > > > > > > them.
> > > > >> > > > > > > > > > > The separate mechanism would anyways be required
> > > to
> > > > >> > delete
> > > > >> > > > > these
> > > > >> > > > > > > > > "leaked"
> > > > >> > > > > > > > > > > segments as there are other cases which could
> > lead
> > > > to
> > > > >> > leaks
> > > > >> > > > > such
> > > > >> > > > > > as
> > > > >> > > > > > > > > network
> > > > >> > > > > > > > > > > problems with RSM mid way writing through.
> > segment
> > > > >> etc.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Thank you for the replies so far. They have made
> > > me
> > > > >> > > re-think
> > > > >> > > > my
> > > > >> > > > > > > > > assumptions
> > > > >> > > > > > > > > > > and this dialogue has been very constructive for
> > > me.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Regards,
> > > > >> > > > > > > > > > > Divij Vaidya
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
> > > > >> > > > > > <jun@confluent.io.invalid
> > > > >> > > > > > > >
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > > Hi, Divij,
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Thanks for the reply.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > It's true that the data in Kafka could be kept
> > > > >> longer
> > > > >> > > with
> > > > >> > > > > > > KIP-405.
> > > > >> > > > > > > > > How
> > > > >> > > > > > > > > > > > much data do you envision to have per broker?
> > > For
> > > > >> 100TB
> > > > >> > > > data
> > > > >> > > > > > per
> > > > >> > > > > > > > > broker,
> > > > >> > > > > > > > > > > > with 1GB segment and segment metadata of 100
> > > > bytes,
> > > > >> it
> > > > >> > > > > requires
> > > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit in
> > > memory.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > > > >> > listRemoteLogSegments()
> > > > >> > > > > > methods.
> > > > >> > > > > > > > > The one
> > > > >> > > > > > > > > > > > you listed
> > > listRemoteLogSegments(TopicIdPartition
> > > > >> > > > > > > topicIdPartition,
> > > > >> > > > > > > > > int
> > > > >> > > > > > > > > > > > leaderEpoch) does return data in offset order.
> > > > >> However,
> > > > >> > > the
> > > > >> > > > > > other
> > > > >> > > > > > > > > > > > one listRemoteLogSegments(TopicIdPartition
> > > > >> > > > topicIdPartition)
> > > > >> > > > > > > > doesn't
> > > > >> > > > > > > > > > > > specify the return order. I assume that you
> > need
> > > > the
> > > > >> > > latter
> > > > >> > > > > to
> > > > >> > > > > > > > > calculate
> > > > >> > > > > > > > > > > > the segment size?
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Thanks,
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Jun
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij Vaidya
> > <
> > > > >> > > > > > > > > divijvaidya13@gmail.com>
> > > > >> > > > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > *Jun,*
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > *"the default implementation of RLMM does
> > > local
> > > > >> > > caching,
> > > > >> > > > > > > right?"*
> > > > >> > > > > > > > > > > > > Yes, Jun. The default implementation of RLMM
> > > > does
> > > > >> > > indeed
> > > > >> > > > > > cache
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > > > segment
> > > > >> > > > > > > > > > > > > metadata today, hence, it won't work for use
> > > > cases
> > > > >> > when
> > > > >> > > > the
> > > > >> > > > > > > > number
> > > > >> > > > > > > > > of
> > > > >> > > > > > > > > > > > > segments in remote storage is large enough
> > to
> > > > >> exceed
> > > > >> > > the
> > > > >> > > > > size
> > > > >> > > > > > > of
> > > > >> > > > > > > > > cache.
> > > > >> > > > > > > > > > > > As
> > > > >> > > > > > > > > > > > > part of this KIP, I will implement the new
> > > > >> proposed
> > > > >> > API
> > > > >> > > > in
> > > > >> > > > > > the
> > > > >> > > > > > > > > default
> > > > >> > > > > > > > > > > > > implementation of RLMM but the underlying
> > > > >> > > implementation
> > > > >> > > > > will
> > > > >> > > > > > > > > still be
> > > > >> > > > > > > > > > > a
> > > > >> > > > > > > > > > > > > scan. I will pick up optimizing that in a
> > > > separate
> > > > >> > PR.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > *"we also cache all segment metadata in the
> > > > >> brokers
> > > > >> > > > without
> > > > >> > > > > > > > > KIP-405. Do
> > > > >> > > > > > > > > > > > you
> > > > >> > > > > > > > > > > > > see a need to change that?"*
> > > > >> > > > > > > > > > > > > Please correct me if I am wrong here but we
> > > > cache
> > > > >> > > > metadata
> > > > >> > > > > > for
> > > > >> > > > > > > > > segments
> > > > >> > > > > > > > > > > > > "residing in local storage". The size of the
> > > > >> current
> > > > >> > > > cache
> > > > >> > > > > > > works
> > > > >> > > > > > > > > fine
> > > > >> > > > > > > > > > > for
> > > > >> > > > > > > > > > > > > the scale of the number of segments that we
> > > > >> expect to
> > > > >> > > > store
> > > > >> > > > > > in
> > > > >> > > > > > > > > local
> > > > >> > > > > > > > > > > > > storage. After KIP-405, that cache will
> > > continue
> > > > >> to
> > > > >> > > store
> > > > >> > > > > > > > metadata
> > > > >> > > > > > > > > for
> > > > >> > > > > > > > > > > > > segments which are residing in local storage
> > > and
> > > > >> > hence,
> > > > >> > > > we
> > > > >> > > > > > > don't
> > > > >> > > > > > > > > need
> > > > >> > > > > > > > > > > to
> > > > >> > > > > > > > > > > > > change that. For segments which have been
> > > > >> offloaded
> > > > >> > to
> > > > >> > > > > remote
> > > > >> > > > > > > > > storage,
> > > > >> > > > > > > > > > > it
> > > > >> > > > > > > > > > > > > would rely on RLMM. Note that the scale of
> > > data
> > > > >> > stored
> > > > >> > > in
> > > > >> > > > > > RLMM
> > > > >> > > > > > > is
> > > > >> > > > > > > > > > > > different
> > > > >> > > > > > > > > > > > > from local cache because the number of
> > > segments
> > > > is
> > > > >> > > > expected
> > > > >> > > > > > to
> > > > >> > > > > > > be
> > > > >> > > > > > > > > much
> > > > >> > > > > > > > > > > > > larger than what current implementation
> > stores
> > > > in
> > > > >> > local
> > > > >> > > > > > > storage.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > 2,3,4:
> > > > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > >> > > > > does
> > > > >> > > > > > > > > specify
> > > > >> > > > > > > > > > > the
> > > > >> > > > > > > > > > > > > order i.e. it returns the segments sorted by
> > > > first
> > > > >> > > offset
> > > > >> > > > > in
> > > > >> > > > > > > > > ascending
> > > > >> > > > > > > > > > > > > order. I am copying the API docs for KIP-405
> > > > here
> > > > >> for
> > > > >> > > > your
> > > > >> > > > > > > > > reference
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > *Returns iterator of remote log segment
> > > > metadata,
> > > > >> > > sorted
> > > > >> > > > by
> > > > >> > > > > > > > {@link
> > > > >> > > > > > > > > > > > > RemoteLogSegmentMetadata#startOffset()}
> > > > >> inascending
> > > > >> > > order
> > > > >> > > > > > which
> > > > >> > > > > > > > > > > contains
> > > > >> > > > > > > > > > > > > the given leader epoch. This is used by
> > remote
> > > > log
> > > > >> > > > > retention
> > > > >> > > > > > > > > management
> > > > >> > > > > > > > > > > > > subsystemto fetch the segment metadata for a
> > > > given
> > > > >> > > leader
> > > > >> > > > > > > > > epoch.@param
> > > > >> > > > > > > > > > > > > topicIdPartition topic partition@param
> > > > >> leaderEpoch
> > > > >> > > > > > leader
> > > > >> > > > > > > > > > > > > epoch@return
> > > > >> > > > > > > > > > > > > Iterator of remote segments, sorted by start
> > > > >> offset
> > > > >> > in
> > > > >> > > > > > > ascending
> > > > >> > > > > > > > > > > order. *
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > *Luke,*
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > 5. Note that we are trying to optimize the
> > > > >> efficiency
> > > > >> > > of
> > > > >> > > > > size
> > > > >> > > > > > > > based
> > > > >> > > > > > > > > > > > > retention for remote storage. KIP-405 does
> > not
> > > > >> > > introduce
> > > > >> > > > a
> > > > >> > > > > > new
> > > > >> > > > > > > > > config
> > > > >> > > > > > > > > > > for
> > > > >> > > > > > > > > > > > > periodically checking remote similar to
> > > > >> > > > > > > > > > > log.retention.check.interval.ms
> > > > >> > > > > > > > > > > > > which is applicable for remote storage.
> > Hence,
> > > > the
> > > > >> > > metric
> > > > >> > > > > > will
> > > > >> > > > > > > be
> > > > >> > > > > > > > > > > updated
> > > > >> > > > > > > > > > > > > at the time of invoking log retention check
> > > for
> > > > >> > remote
> > > > >> > > > tier
> > > > >> > > > > > > which
> > > > >> > > > > > > > > is
> > > > >> > > > > > > > > > > > > pending implementation today. We can perhaps
> > > > come
> > > > >> > back
> > > > >> > > > and
> > > > >> > > > > > > update
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > > > > metric description after the implementation
> > of
> > > > log
> > > > >> > > > > retention
> > > > >> > > > > > > > check
> > > > >> > > > > > > > > in
> > > > >> > > > > > > > > > > > > RemoteLogManager.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > --
> > > > >> > > > > > > > > > > > > Divij Vaidya
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke Chen <
> > > > >> > > > > showuon@gmail.com
> > > > >> > > > > > >
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > Hi Divij,
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > One more question about the metric:
> > > > >> > > > > > > > > > > > > > I think the metric will be updated when
> > > > >> > > > > > > > > > > > > > (1) each time we run the log retention
> > check
> > > > >> (that
> > > > >> > > is,
> > > > >> > > > > > > > > > > > > > log.retention.check.interval.ms)
> > > > >> > > > > > > > > > > > > > (2) When user explicitly call
> > > getRemoteLogSize
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > Is that correct?
> > > > >> > > > > > > > > > > > > > Maybe we should add a note in metric
> > > > >> description,
> > > > >> > > > > > otherwise,
> > > > >> > > > > > > > when
> > > > >> > > > > > > > > > > user
> > > > >> > > > > > > > > > > > > got,
> > > > >> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes, will be
> > > > >> > surprised.
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > Otherwise, LGTM
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > Thank you for the KIP
> > > > >> > > > > > > > > > > > > > Luke
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun Rao
> > > > >> > > > > > > > <jun@confluent.io.invalid
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > Hi, Divij,
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > Thanks for the explanation.
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > 1. Hmm, the default implementation of
> > RLMM
> > > > >> does
> > > > >> > > local
> > > > >> > > > > > > > caching,
> > > > >> > > > > > > > > > > right?
> > > > >> > > > > > > > > > > > > > > Currently, we also cache all segment
> > > > metadata
> > > > >> in
> > > > >> > > the
> > > > >> > > > > > > brokers
> > > > >> > > > > > > > > > > without
> > > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need to change
> > that?
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation makes
> > sense.
> > > > >> > However,
> > > > >> > > > > > > > > > > > > > > currently,
> > > > >> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > >> > > > > > > > > doesn't
> > > > >> > > > > > > > > > > > > > specify
> > > > >> > > > > > > > > > > > > > > a particular order of the iterator. Do
> > you
> > > > >> intend
> > > > >> > > to
> > > > >> > > > > > change
> > > > >> > > > > > > > > that?
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > Thanks,
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > Jun
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM Divij
> > > Vaidya
> > > > <
> > > > >> > > > > > > > > > > divijvaidya13@gmail.com
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > Hey Jun
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > Thank you for your comments.
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor could ensure
> > that
> > > > >> > > > > > > > > listRemoteLogSegments()
> > > > >> > > > > > > > > > > is
> > > > >> > > > > > > > > > > > > > fast"*
> > > > >> > > > > > > > > > > > > > > > This would be ideal but pragmatically,
> > > it
> > > > is
> > > > >> > > > > difficult
> > > > >> > > > > > to
> > > > >> > > > > > > > > ensure
> > > > >> > > > > > > > > > > > that
> > > > >> > > > > > > > > > > > > > > > listRemoteLogSegments() is fast. This
> > is
> > > > >> > because
> > > > >> > > of
> > > > >> > > > > the
> > > > >> > > > > > > > > > > possibility
> > > > >> > > > > > > > > > > > > of
> > > > >> > > > > > > > > > > > > > a
> > > > >> > > > > > > > > > > > > > > > large number of segments (much larger
> > > than
> > > > >> what
> > > > >> > > > Kafka
> > > > >> > > > > > > > > currently
> > > > >> > > > > > > > > > > > > handles
> > > > >> > > > > > > > > > > > > > > > with local storage today) would make
> > it
> > > > >> > > infeasible
> > > > >> > > > to
> > > > >> > > > > > > adopt
> > > > >> > > > > > > > > > > > > strategies
> > > > >> > > > > > > > > > > > > > > such
> > > > >> > > > > > > > > > > > > > > > as local caching to improve the
> > > > performance
> > > > >> of
> > > > >> > > > > > > > > > > > listRemoteLogSegments.
> > > > >> > > > > > > > > > > > > > > Apart
> > > > >> > > > > > > > > > > > > > > > from caching (which won't work due to
> > > size
> > > > >> > > > > > limitations) I
> > > > >> > > > > > > > > can't
> > > > >> > > > > > > > > > > > think
> > > > >> > > > > > > > > > > > > > of
> > > > >> > > > > > > > > > > > > > > > other strategies which may eliminate
> > the
> > > > >> need
> > > > >> > for
> > > > >> > > > IO
> > > > >> > > > > > > > > > > > > > > > operations proportional to the number
> > of
> > > > >> total
> > > > >> > > > > > segments.
> > > > >> > > > > > > > > Please
> > > > >> > > > > > > > > > > > > advise
> > > > >> > > > > > > > > > > > > > if
> > > > >> > > > > > > > > > > > > > > > you have something in mind.
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds the
> > retention
> > > > >> size,
> > > > >> > we
> > > > >> > > > need
> > > > >> > > > > > to
> > > > >> > > > > > > > > > > determine
> > > > >> > > > > > > > > > > > > the
> > > > >> > > > > > > > > > > > > > > > subset of segments to delete to bring
> > > the
> > > > >> size
> > > > >> > > > within
> > > > >> > > > > > the
> > > > >> > > > > > > > > > > retention
> > > > >> > > > > > > > > > > > > > > limit.
> > > > >> > > > > > > > > > > > > > > > Do we need to call
> > > > >> > > > > > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > >> > > > > > > > > > > > > to
> > > > >> > > > > > > > > > > > > > > > determine that?"*
> > > > >> > > > > > > > > > > > > > > > Yes, we need to call
> > > > >> listRemoteLogSegments() to
> > > > >> > > > > > determine
> > > > >> > > > > > > > > which
> > > > >> > > > > > > > > > > > > > segments
> > > > >> > > > > > > > > > > > > > > > should be deleted. But there is a
> > > > difference
> > > > >> > with
> > > > >> > > > the
> > > > >> > > > > > use
> > > > >> > > > > > > > > case we
> > > > >> > > > > > > > > > > > are
> > > > >> > > > > > > > > > > > > > > > trying to optimize with this KIP. To
> > > > >> determine
> > > > >> > > the
> > > > >> > > > > > subset
> > > > >> > > > > > > > of
> > > > >> > > > > > > > > > > > segments
> > > > >> > > > > > > > > > > > > > > which
> > > > >> > > > > > > > > > > > > > > > would be deleted, we only read
> > metadata
> > > > for
> > > > >> > > > segments
> > > > >> > > > > > > which
> > > > >> > > > > > > > > would
> > > > >> > > > > > > > > > > be
> > > > >> > > > > > > > > > > > > > > deleted
> > > > >> > > > > > > > > > > > > > > > via the listRemoteLogSegments(). But
> > to
> > > > >> > determine
> > > > >> > > > the
> > > > >> > > > > > > > > > > totalLogSize,
> > > > >> > > > > > > > > > > > > > which
> > > > >> > > > > > > > > > > > > > > > is required every time retention logic
> > > > >> based on
> > > > >> > > > size
> > > > >> > > > > > > > > executes, we
> > > > >> > > > > > > > > > > > > read
> > > > >> > > > > > > > > > > > > > > > metadata of *all* the segments in
> > remote
> > > > >> > storage.
> > > > >> > > > > > Hence,
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > > number
> > > > >> > > > > > > > > > > > > of
> > > > >> > > > > > > > > > > > > > > > results returned by
> > > > >> > > > > > > > > > > >
> > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > >> > > > > > > > > > > > > > *is
> > > > >> > > > > > > > > > > > > > > > different when we are calculating
> > > > >> totalLogSize
> > > > >> > > vs.
> > > > >> > > > > when
> > > > >> > > > > > > we
> > > > >> > > > > > > > > are
> > > > >> > > > > > > > > > > > > > > determining
> > > > >> > > > > > > > > > > > > > > > the subset of segments to delete.
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > 3.
> > > > >> > > > > > > > > > > > > > > > *"Also, what about time-based
> > retention?
> > > > To
> > > > >> > make
> > > > >> > > > that
> > > > >> > > > > > > > > efficient,
> > > > >> > > > > > > > > > > do
> > > > >> > > > > > > > > > > > > we
> > > > >> > > > > > > > > > > > > > > need
> > > > >> > > > > > > > > > > > > > > > to make some additional interface
> > > > >> changes?"*No.
> > > > >> > > > Note
> > > > >> > > > > > that
> > > > >> > > > > > > > > time
> > > > >> > > > > > > > > > > > > > complexity
> > > > >> > > > > > > > > > > > > > > > to determine the segments for
> > retention
> > > is
> > > > >> > > > different
> > > > >> > > > > > for
> > > > >> > > > > > > > time
> > > > >> > > > > > > > > > > based
> > > > >> > > > > > > > > > > > > vs.
> > > > >> > > > > > > > > > > > > > > > size based. For time based, the time
> > > > >> complexity
> > > > >> > > is
> > > > >> > > > a
> > > > >> > > > > > > > > function of
> > > > >> > > > > > > > > > > > the
> > > > >> > > > > > > > > > > > > > > number
> > > > >> > > > > > > > > > > > > > > > of segments which are "eligible for
> > > > >> deletion"
> > > > >> > > > (since
> > > > >> > > > > we
> > > > >> > > > > > > > only
> > > > >> > > > > > > > > read
> > > > >> > > > > > > > > > > > > > > metadata
> > > > >> > > > > > > > > > > > > > > > for segments which would be deleted)
> > > > >> whereas in
> > > > >> > > > size
> > > > >> > > > > > > based
> > > > >> > > > > > > > > > > > retention,
> > > > >> > > > > > > > > > > > > > the
> > > > >> > > > > > > > > > > > > > > > time complexity is a function of "all
> > > > >> segments"
> > > > >> > > > > > available
> > > > >> > > > > > > > in
> > > > >> > > > > > > > > > > remote
> > > > >> > > > > > > > > > > > > > > storage
> > > > >> > > > > > > > > > > > > > > > (metadata of all segments needs to be
> > > read
> > > > >> to
> > > > >> > > > > calculate
> > > > >> > > > > > > the
> > > > >> > > > > > > > > total
> > > > >> > > > > > > > > > > > > > size).
> > > > >> > > > > > > > > > > > > > > As
> > > > >> > > > > > > > > > > > > > > > you may observe, this KIP will bring
> > the
> > > > >> time
> > > > >> > > > > > complexity
> > > > >> > > > > > > > for
> > > > >> > > > > > > > > both
> > > > >> > > > > > > > > > > > > time
> > > > >> > > > > > > > > > > > > > > > based retention & size based retention
> > > to
> > > > >> the
> > > > >> > > same
> > > > >> > > > > > > > function.
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > 4. Also, please note that this new API
> > > > >> > introduced
> > > > >> > > > in
> > > > >> > > > > > this
> > > > >> > > > > > > > KIP
> > > > >> > > > > > > > > > > also
> > > > >> > > > > > > > > > > > > > > enables
> > > > >> > > > > > > > > > > > > > > > us to provide a metric for total size
> > of
> > > > >> data
> > > > >> > > > stored
> > > > >> > > > > in
> > > > >> > > > > > > > > remote
> > > > >> > > > > > > > > > > > > storage.
> > > > >> > > > > > > > > > > > > > > > Without the API, calculation of this
> > > > metric
> > > > >> > will
> > > > >> > > > > become
> > > > >> > > > > > > > very
> > > > >> > > > > > > > > > > > > expensive
> > > > >> > > > > > > > > > > > > > > with
> > > > >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> > > > >> > > > > > > > > > > > > > > > I understand that your motivation here
> > > is
> > > > to
> > > > >> > > avoid
> > > > >> > > > > > > > polluting
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > > > > > > interface
> > > > >> > > > > > > > > > > > > > > > with optimization specific APIs and I
> > > will
> > > > >> > agree
> > > > >> > > > with
> > > > >> > > > > > > that
> > > > >> > > > > > > > > goal.
> > > > >> > > > > > > > > > > > But
> > > > >> > > > > > > > > > > > > I
> > > > >> > > > > > > > > > > > > > > > believe that this new API proposed in
> > > the
> > > > >> KIP
> > > > >> > > > brings
> > > > >> > > > > in
> > > > >> > > > > > > > > > > significant
> > > > >> > > > > > > > > > > > > > > > improvement and there is no other work
> > > > >> around
> > > > >> > > > > available
> > > > >> > > > > > > to
> > > > >> > > > > > > > > > > achieve
> > > > >> > > > > > > > > > > > > the
> > > > >> > > > > > > > > > > > > > > same
> > > > >> > > > > > > > > > > > > > > > performance.
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > Regards,
> > > > >> > > > > > > > > > > > > > > > Divij Vaidya
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM Jun
> > Rao
> > > > >> > > > > > > > > <jun@confluent.io.invalid
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > > > >> > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry for the
> > late
> > > > >> reply.
> > > > >> > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > The motivation of the KIP is to
> > > improve
> > > > >> the
> > > > >> > > > > > efficiency
> > > > >> > > > > > > of
> > > > >> > > > > > > > > size
> > > > >> > > > > > > > > > > > > based
> > > > >> > > > > > > > > > > > > > > > > retention. I am not sure the
> > proposed
> > > > >> changes
> > > > >> > > are
> > > > >> > > > > > > enough.
> > > > >> > > > > > > > > For
> > > > >> > > > > > > > > > > > > > example,
> > > > >> > > > > > > > > > > > > > > if
> > > > >> > > > > > > > > > > > > > > > > the size exceeds the retention size,
> > > we
> > > > >> need
> > > > >> > to
> > > > >> > > > > > > determine
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > > > > subset
> > > > >> > > > > > > > > > > > > > of
> > > > >> > > > > > > > > > > > > > > > > segments to delete to bring the size
> > > > >> within
> > > > >> > the
> > > > >> > > > > > > retention
> > > > >> > > > > > > > > > > limit.
> > > > >> > > > > > > > > > > > Do
> > > > >> > > > > > > > > > > > > > we
> > > > >> > > > > > > > > > > > > > > > need
> > > > >> > > > > > > > > > > > > > > > > to call
> > > > >> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > > > > > determine
> > > > >> > > > > > > > > > > > > > > > that?
> > > > >> > > > > > > > > > > > > > > > > Also, what about time-based
> > retention?
> > > > To
> > > > >> > make
> > > > >> > > > that
> > > > >> > > > > > > > > efficient,
> > > > >> > > > > > > > > > > do
> > > > >> > > > > > > > > > > > > we
> > > > >> > > > > > > > > > > > > > > need
> > > > >> > > > > > > > > > > > > > > > > to make some additional interface
> > > > changes?
> > > > >> > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > An alternative approach is for the
> > > RLMM
> > > > >> > > > implementor
> > > > >> > > > > > to
> > > > >> > > > > > > > make
> > > > >> > > > > > > > > > > sure
> > > > >> > > > > > > > > > > > > > > > > that
> > > > >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > >> > > > > > > is
> > > > >> > > > > > > > > fast
> > > > >> > > > > > > > > > > > > (e.g.,
> > > > >> > > > > > > > > > > > > > > with
> > > > >> > > > > > > > > > > > > > > > > local caching). This way, we could
> > > keep
> > > > >> the
> > > > >> > > > > interface
> > > > >> > > > > > > > > simple.
> > > > >> > > > > > > > > > > > Have
> > > > >> > > > > > > > > > > > > we
> > > > >> > > > > > > > > > > > > > > > > considered that?
> > > > >> > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > Thanks,
> > > > >> > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > Jun
> > > > >> > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28 AM
> > Divij
> > > > >> Vaidya
> > > > >> > <
> > > > >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > Hey folks
> > > > >> > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > Does anyone else have any thoughts
> > > on
> > > > >> this
> > > > >> > > > > before I
> > > > >> > > > > > > > > propose
> > > > >> > > > > > > > > > > > this
> > > > >> > > > > > > > > > > > > > for
> > > > >> > > > > > > > > > > > > > > a
> > > > >> > > > > > > > > > > > > > > > > > vote?
> > > > >> > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > --
> > > > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > >> > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57 PM
> > > Satish
> > > > >> > > Duggana
> > > > >> > > > <
> > > > >> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> > > > >> > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP Divij!
> > > > >> > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > This is a nice improvement to
> > > avoid
> > > > >> > > > > recalculation
> > > > >> > > > > > > of
> > > > >> > > > > > > > > size.
> > > > >> > > > > > > > > > > > > > > Customized
> > > > >> > > > > > > > > > > > > > > > > > RLMMs
> > > > >> > > > > > > > > > > > > > > > > > > can implement the best possible
> > > > >> approach
> > > > >> > by
> > > > >> > > > > > caching
> > > > >> > > > > > > > or
> > > > >> > > > > > > > > > > > > > maintaining
> > > > >> > > > > > > > > > > > > > > > the
> > > > >> > > > > > > > > > > > > > > > > > size
> > > > >> > > > > > > > > > > > > > > > > > > in an efficient way. But this is
> > > > not a
> > > > >> > big
> > > > >> > > > > > concern
> > > > >> > > > > > > > for
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > > > > > default
> > > > >> > > > > > > > > > > > > > > > > topic
> > > > >> > > > > > > > > > > > > > > > > > > based RLMM as mentioned in the
> > > KIP.
> > > > >> > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > > > >> > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at 18:48,
> > > Divij
> > > > >> > Vaidya
> > > > >> > > <
> > > > >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > > >> > > > > > > > > > > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > Thank you for your review
> > Luke.
> > > > >> > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that would the new
> > > > >> > > > > > `RemoteLogSizeBytes`
> > > > >> > > > > > > > > metric
> > > > >> > > > > > > > > > > > be a
> > > > >> > > > > > > > > > > > > > > > > > performance
> > > > >> > > > > > > > > > > > > > > > > > > > overhead? Although we move the
> > > > >> > > calculation
> > > > >> > > > > to a
> > > > >> > > > > > > > > seperate
> > > > >> > > > > > > > > > > > API,
> > > > >> > > > > > > > > > > > > > we
> > > > >> > > > > > > > > > > > > > > > > still
> > > > >> > > > > > > > > > > > > > > > > > > > can't assume users will
> > > implement
> > > > a
> > > > >> > > > > > light-weight
> > > > >> > > > > > > > > method,
> > > > >> > > > > > > > > > > > > right?
> > > > >> > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > This metric would be logged
> > > using
> > > > >> the
> > > > >> > > > > > information
> > > > >> > > > > > > > > that is
> > > > >> > > > > > > > > > > > > > already
> > > > >> > > > > > > > > > > > > > > > > being
> > > > >> > > > > > > > > > > > > > > > > > > > calculated for handling remote
> > > > >> > retention
> > > > >> > > > > logic,
> > > > >> > > > > > > > > hence, no
> > > > >> > > > > > > > > > > > > > > > additional
> > > > >> > > > > > > > > > > > > > > > > > work
> > > > >> > > > > > > > > > > > > > > > > > > > is required to calculate this
> > > > >> metric.
> > > > >> > > More
> > > > >> > > > > > > > > specifically,
> > > > >> > > > > > > > > > > > > > whenever
> > > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls
> > > > >> getRemoteLogSize
> > > > >> > > > API,
> > > > >> > > > > > this
> > > > >> > > > > > > > > metric
> > > > >> > > > > > > > > > > > > would
> > > > >> > > > > > > > > > > > > > be
> > > > >> > > > > > > > > > > > > > > > > > > captured.
> > > > >> > > > > > > > > > > > > > > > > > > > This API call is made every
> > time
> > > > >> > > > > > RemoteLogManager
> > > > >> > > > > > > > > wants
> > > > >> > > > > > > > > > > to
> > > > >> > > > > > > > > > > > > > handle
> > > > >> > > > > > > > > > > > > > > > > > expired
> > > > >> > > > > > > > > > > > > > > > > > > > remote log segments (which
> > > should
> > > > be
> > > > >> > > > > periodic).
> > > > >> > > > > > > > Does
> > > > >> > > > > > > > > that
> > > > >> > > > > > > > > > > > > > address
> > > > >> > > > > > > > > > > > > > > > > your
> > > > >> > > > > > > > > > > > > > > > > > > > concern?
> > > > >> > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > >> > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at 11:01
> > AM
> > > > >> Luke
> > > > >> > > Chen
> > > > >> > > > <
> > > > >> > > > > > > > > > > > > showuon@gmail.com>
> > > > >> > > > > > > > > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP!
> > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > I think it makes sense to
> > > > delegate
> > > > >> > the
> > > > >> > > > > > > > > responsibility
> > > > >> > > > > > > > > > > of
> > > > >> > > > > > > > > > > > > > > > > calculation
> > > > >> > > > > > > > > > > > > > > > > > to
> > > > >> > > > > > > > > > > > > > > > > > > > the
> > > > >> > > > > > > > > > > > > > > > > > > > > specific
> > > > RemoteLogMetadataManager
> > > > >> > > > > > > implementation.
> > > > >> > > > > > > > > > > > > > > > > > > > > But one thing I'm not quite
> > > > sure,
> > > > >> is
> > > > >> > > that
> > > > >> > > > > > would
> > > > >> > > > > > > > > the new
> > > > >> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes` metric
> > > be a
> > > > >> > > > > performance
> > > > >> > > > > > > > > overhead?
> > > > >> > > > > > > > > > > > > > > > > > > > > Although we move the
> > > calculation
> > > > >> to a
> > > > >> > > > > > seperate
> > > > >> > > > > > > > > API, we
> > > > >> > > > > > > > > > > > > still
> > > > >> > > > > > > > > > > > > > > > can't
> > > > >> > > > > > > > > > > > > > > > > > > assume
> > > > >> > > > > > > > > > > > > > > > > > > > > users will implement a
> > > > >> light-weight
> > > > >> > > > method,
> > > > >> > > > > > > > right?
> > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > >> > > > > > > > > > > > > > > > > > > > > Luke
> > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at 5:47
> > PM
> > > > >> Divij
> > > > >> > > > > Vaidya <
> > > > >> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> > > > >> > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > > Please take a look at this
> > > KIP
> > > > >> > which
> > > > >> > > > > > proposes
> > > > >> > > > > > > > an
> > > > >> > > > > > > > > > > > > extension
> > > > >> > > > > > > > > > > > > > to
> > > > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > > > >> > > > > > > > > > > > > > > > > > > > > This
> > > > >> > > > > > > > > > > > > > > > > > > > > > is my first KIP with
> > Apache
> > > > >> Kafka
> > > > >> > > > > community
> > > > >> > > > > > > so
> > > > >> > > > > > > > > any
> > > > >> > > > > > > > > > > > > feedback
> > > > >> > > > > > > > > > > > > > > > would
> > > > >> > > > > > > > > > > > > > > > > > be
> > > > >> > > > > > > > > > > > > > > > > > > > > highly
> > > > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > > > --
> > > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software Engineer
> > > > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > > >> > > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Divij,

Sorry for the late reply.

Given your explanation, the new API sounds reasonable to me. Is that enough
to build the external metadata layer for the remote segments or do you need
some additional API changes?

Thanks,

Jun

On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <di...@gmail.com> wrote:

> Thank you for looking into this Kamal.
>
> You are right in saying that a cold start (i.e. leadership failover or
> broker startup) does not impact the broker startup duration. But it does
> have the following impact:
> 1. It leads to a burst of full-scan requests to RLMM in case multiple
> leadership failovers occur at the same time. Even if the RLMM
> implementation has the capability to serve the total size from an index
> (and hence handle this burst), we wouldn't be able to use it since the
> current API necessarily calls for a full scan.
> 2. The archival (copying of data to tiered storage) process will have a
> delayed start. The delayed start of archival could lead to local build up
> of data which may lead to disk full.
>
> The disadvantage of adding this new API is that every provider will have to
> implement it, agreed. But I believe that this tradeoff is worthwhile since
> the default implementation could be the same as you mentioned, i.e. keeping
> cumulative in-memory count.
>
> --
> Divij Vaidya
>
>
>
> On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
> kamal.chandraprakash@gmail.com> wrote:
>
> > Hi Divij,
> >
> > Thanks for the KIP! Sorry for the late reply.
> >
> > Can you explain the rejected alternative-3?
> > Store the cumulative size of remote tier log in-memory at
> RemoteLogManager
> > "*Cons*: Every time a broker starts-up, it will scan through all the
> > segments in the remote tier to initialise the in-memory value. This would
> > increase the broker start-up time."
> >
> > Keeping the source of truth to determine the remote-log-size in the
> leader
> > would be consistent across different implementations of the plugin. The
> > concern posted in the KIP is that we are calculating the remote-log-size
> on
> > each iteration of the cleaner thread (say 5 mins). If we calculate only
> > once during broker startup or during the leadership reassignment, do we
> > still need the cache?
> >
> > The broker startup-time won't be affected by the remote log manager
> > initialisation. The broker continue to start accepting the new
> > produce/fetch requests, while the RLM thread in the background can
> > determine the remote-log-size once and start copying/deleting the
> segments.
> >
> > Thanks,
> > Kamal
> >
> > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <di...@gmail.com>
> > wrote:
> >
> > > Satish / Jun
> > >
> > > Do you have any thoughts on this?
> > >
> > > --
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <di...@gmail.com>
> > > wrote:
> > >
> > > > Hey Jun
> > > >
> > > > It has been a while since this KIP got some attention. While we wait
> > for
> > > > Satish to chime in here, perhaps I can answer your question.
> > > >
> > > > > Could you explain how you exposed the log size in your KIP-405
> > > > implementation?
> > > >
> > > > The APIs available in RLMM as per KIP405
> > > > are, addRemoteLogSegmentMetadata(), updateRemoteLogSegmentMetadata(),
> > > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > > putRemotePartitionDeleteMetadata(), listRemoteLogSegments(),
> > > onPartitionLeadershipChanges()
> > > > and onStopPartitions(). None of these APIs allow us to expose the log
> > > size,
> > > > hence, the only option that remains is to list all segments using
> > > > listRemoteLogSegments() and aggregate them every time we require to
> > > > calculate the size. Based on our prior discussion, this requires
> > reading
> > > > all segment metadata which won't work for non-local RLMM
> > implementations.
> > > > Satish's implementation also performs a full scan and calculates the
> > > > aggregate. see:
> > > >
> > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > > >
> > > >
> > > > Does this answer your question?
> > > >
> > > > --
> > > > Divij Vaidya
> > > >
> > > >
> > > >
> > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao <ju...@confluent.io.invalid>
> > > wrote:
> > > >
> > > >> Hi, Divij,
> > > >>
> > > >> Thanks for the explanation.
> > > >>
> > > >> Good question.
> > > >>
> > > >> Hi, Satish,
> > > >>
> > > >> Could you explain how you exposed the log size in your KIP-405
> > > >> implementation?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jun
> > > >>
> > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <
> divijvaidya13@gmail.com
> > >
> > > >> wrote:
> > > >>
> > > >> > Hey Jun
> > > >> >
> > > >> > Yes, it is possible to maintain the log size in the cache (see
> > > rejected
> > > >> > alternative#3 in the KIP) but I did not understand how it is
> > possible
> > > to
> > > >> > retrieve it without the new API. The log size could be calculated
> on
> > > >> > startup by scanning through the segments (though I would disagree
> > that
> > > >> this
> > > >> > is the right approach since scanning itself takes order of minutes
> > and
> > > >> > hence delay the start of archive process), and incrementally
> > > maintained
> > > >> > afterwards, even then, we would need an API in
> > > RemoteLogMetadataManager
> > > >> so
> > > >> > that RLM could fetch the cached size!
> > > >> >
> > > >> > If we wish to cache the size without adding a new API, then we
> need
> > to
> > > >> > cache the size in RLM itself (instead of RLMM implementation) and
> > > >> > incrementally manage it. The downside of longer archive time at
> > > startup
> > > >> > (due to initial scale) still remains valid in this situation.
> > > >> >
> > > >> > --
> > > >> > Divij Vaidya
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao <jun@confluent.io.invalid
> >
> > > >> wrote:
> > > >> >
> > > >> > > Hi, Divij,
> > > >> > >
> > > >> > > Thanks for the explanation.
> > > >> > >
> > > >> > > If there is in-memory cache, could we maintain the log size in
> the
> > > >> cache
> > > >> > > with the existing API? For example, a replica could make a
> > > >> > > listRemoteLogSegments(TopicIdPartition topicIdPartition) call on
> > > >> startup
> > > >> > to
> > > >> > > get the remote segment size before the current leaderEpoch. The
> > > leader
> > > >> > > could then maintain the size incrementally afterwards. On leader
> > > >> change,
> > > >> > > other replicas can make a listRemoteLogSegments(TopicIdPartition
> > > >> > > topicIdPartition, int leaderEpoch) call to get the size of newly
> > > >> > generated
> > > >> > > segments.
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > > Jun
> > > >> > >
> > > >> > >
> > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > > divijvaidya13@gmail.com
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > > Is the new method enough for doing size-based retention?
> > > >> > > >
> > > >> > > > Yes. You are right in assuming that this API only provides the
> > > >> Remote
> > > >> > > > storage size (for current epoch chain). We would use this API
> > for
> > > >> size
> > > >> > > > based retention along with a value of localOnlyLogSegmentSize
> > > which
> > > >> is
> > > >> > > > computed as Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > > >> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size =
> > > >> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I have
> > updated
> > > >> the
> > > >> > KIP
> > > >> > > > with this information. You can also check an example
> > > implementation
> > > >> at
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > > >> > > >
> > > >> > > >
> > > >> > > > > Do you imagine all accesses to remote metadata will be
> across
> > > the
> > > >> > > network
> > > >> > > > or will there be some local in-memory cache?
> > > >> > > >
> > > >> > > > I would expect a disk-less implementation to maintain a finite
> > > >> > in-memory
> > > >> > > > cache for segment metadata to optimize the number of network
> > calls
> > > >> made
> > > >> > > to
> > > >> > > > fetch the data. In future, we can think about bringing this
> > finite
> > > >> size
> > > >> > > > cache into RLM itself but that's probably a conversation for a
> > > >> > different
> > > >> > > > KIP. There are many other things we would like to do to
> optimize
> > > the
> > > >> > > Tiered
> > > >> > > > storage interface such as introducing a circular buffer /
> > > streaming
> > > >> > > > interface from RSM (so that we don't have to wait to fetch the
> > > >> entire
> > > >> > > > segment before starting to send records to the consumer),
> > caching
> > > >> the
> > > >> > > > segments fetched from RSM locally (I would assume all RSM
> plugin
> > > >> > > > implementations to do this, might as well add it to RLM) etc.
> > > >> > > >
> > > >> > > > --
> > > >> > > > Divij Vaidya
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> > <jun@confluent.io.invalid
> > > >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > > Hi, Divij,
> > > >> > > > >
> > > >> > > > > Thanks for the reply.
> > > >> > > > >
> > > >> > > > > Is the new method enough for doing size-based retention? It
> > > gives
> > > >> the
> > > >> > > > total
> > > >> > > > > size of the remote segments, but it seems that we still
> don't
> > > know
> > > >> > the
> > > >> > > > > exact total size for a log since there could be overlapping
> > > >> segments
> > > >> > > > > between the remote and the local segments.
> > > >> > > > >
> > > >> > > > > You mentioned a disk-less implementation. Do you imagine all
> > > >> accesses
> > > >> > > to
> > > >> > > > > remote metadata will be across the network or will there be
> > some
> > > >> > local
> > > >> > > > > in-memory cache?
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > >
> > > >> > > > > Jun
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > > >> divijvaidya13@gmail.com
> > > >> > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > The method is needed for RLMM implementations which fetch
> > the
> > > >> > > > information
> > > >> > > > > > over the network and not for the disk based
> implementations
> > > >> (such
> > > >> > as
> > > >> > > > the
> > > >> > > > > > default topic based RLMM).
> > > >> > > > > >
> > > >> > > > > > I would argue that adding this API makes the interface
> more
> > > >> generic
> > > >> > > > than
> > > >> > > > > > what it is today. This is because, with the current APIs
> an
> > > >> > > implementor
> > > >> > > > > is
> > > >> > > > > > restricted to use disk based RLMM solutions only (i.e. the
> > > >> default
> > > >> > > > > > solution) whereas if we add this new API, we unblock usage
> > of
> > > >> > network
> > > >> > > > > based
> > > >> > > > > > RLMM implementations such as databases.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > > <jun@confluent.io.invalid
> > > >> >
> > > >> > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi, Divij,
> > > >> > > > > > >
> > > >> > > > > > > Thanks for the reply.
> > > >> > > > > > >
> > > >> > > > > > > Point#2. My high level question is that is the new
> method
> > > >> needed
> > > >> > > for
> > > >> > > > > > every
> > > >> > > > > > > implementation of remote storage or just for a specific
> > > >> > > > implementation.
> > > >> > > > > > The
> > > >> > > > > > > issues that you pointed out exist for the default
> > > >> implementation
> > > >> > of
> > > >> > > > > RLMM
> > > >> > > > > > as
> > > >> > > > > > > well and so far, the default implementation hasn't
> found a
> > > >> need
> > > >> > > for a
> > > >> > > > > > > similar new method. For public interface, ideally we
> want
> > to
> > > >> make
> > > >> > > it
> > > >> > > > > more
> > > >> > > > > > > general.
> > > >> > > > > > >
> > > >> > > > > > > Thanks,
> > > >> > > > > > >
> > > >> > > > > > > Jun
> > > >> > > > > > >
> > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
> > > >> > > > divijvaidya13@gmail.com>
> > > >> > > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Thank you Jun and Alex for your comments.
> > > >> > > > > > > >
> > > >> > > > > > > > Point#1: You are right Jun. As Alex mentioned, the
> > > "derived
> > > >> > > > metadata"
> > > >> > > > > > can
> > > >> > > > > > > > increase the size of cached metadata by a factor of 10
> > but
> > > >> it
> > > >> > > > should
> > > >> > > > > be
> > > >> > > > > > > ok
> > > >> > > > > > > > to cache just the actual metadata. My point about size
> > > >> being a
> > > >> > > > > > limitation
> > > >> > > > > > > > for using cache is not valid anymore.
> > > >> > > > > > > >
> > > >> > > > > > > > Point#2: For a new replica, it would still have to
> fetch
> > > the
> > > >> > > > metadata
> > > >> > > > > > > over
> > > >> > > > > > > > the network to initiate the warm up of the cache and
> > > hence,
> > > >> > > > increase
> > > >> > > > > > the
> > > >> > > > > > > > start time of the archival process. Please also note
> the
> > > >> > > > > repercussions
> > > >> > > > > > of
> > > >> > > > > > > > the warm up scan that Alex mentioned in this thread as
> > > part
> > > >> of
> > > >> > > > > #102.2.
> > > >> > > > > > > >
> > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that. My
> point
> > > >> about
> > > >> > > size
> > > >> > > > > > being
> > > >> > > > > > > a
> > > >> > > > > > > > limitation for using cache is not valid anymore.
> > > >> > > > > > > >
> > > >> > > > > > > > 101#: Alex, if I understand correctly, you are
> > suggesting
> > > to
> > > >> > > cache
> > > >> > > > > the
> > > >> > > > > > > > total size at the leader and update it on archival.
> This
> > > >> > wouldn't
> > > >> > > > > work
> > > >> > > > > > > for
> > > >> > > > > > > > cases when the leader restarts where we would have to
> > > make a
> > > >> > full
> > > >> > > > > scan
> > > >> > > > > > > > to update the total size entry on startup. We expect
> > users
> > > >> to
> > > >> > > store
> > > >> > > > > > data
> > > >> > > > > > > > over longer duration in remote storage which increases
> > the
> > > >> > > > likelihood
> > > >> > > > > > of
> > > >> > > > > > > > leader restarts / failovers.
> > > >> > > > > > > >
> > > >> > > > > > > > 102#.1: I don't think that the current design
> > accommodates
> > > >> the
> > > >> > > fact
> > > >> > > > > > that
> > > >> > > > > > > > data corruption could happen at the RLMM plugin (we
> > don't
> > > >> have
> > > >> > > > > checksum
> > > >> > > > > > > as
> > > >> > > > > > > > a field in metadata as part of KIP405). If data
> > corruption
> > > >> > > occurs,
> > > >> > > > w/
> > > >> > > > > > or
> > > >> > > > > > > > w/o the cache, it would be a different problem to
> > solve. I
> > > >> > would
> > > >> > > > like
> > > >> > > > > > to
> > > >> > > > > > > > keep this outside the scope of this KIP.
> > > >> > > > > > > >
> > > >> > > > > > > > 102#.2: Agree. This remains as the main concern for
> > using
> > > >> the
> > > >> > > cache
> > > >> > > > > to
> > > >> > > > > > > > fetch total size.
> > > >> > > > > > > >
> > > >> > > > > > > > Regards,
> > > >> > > > > > > > Divij Vaidya
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre Dupriez <
> > > >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > Hi Divij,
> > > >> > > > > > > > >
> > > >> > > > > > > > > Thanks for the KIP. Please find some comments based
> on
> > > >> what I
> > > >> > > > read
> > > >> > > > > on
> > > >> > > > > > > > > this thread so far - apologies for the repeats and
> the
> > > >> late
> > > >> > > > reply.
> > > >> > > > > > > > >
> > > >> > > > > > > > > If I understand correctly, one of the main elements
> of
> > > >> > > discussion
> > > >> > > > > is
> > > >> > > > > > > > > about caching in Kafka versus delegation of
> providing
> > > the
> > > >> > > remote
> > > >> > > > > size
> > > >> > > > > > > > > of a topic-partition to the plugin.
> > > >> > > > > > > > >
> > > >> > > > > > > > > A few comments:
> > > >> > > > > > > > >
> > > >> > > > > > > > > 100. The size of the “derived metadata” which is
> > managed
> > > >> by
> > > >> > the
> > > >> > > > > > plugin
> > > >> > > > > > > > > to represent an rlmMetadata can indeed be close to 1
> > kB
> > > on
> > > >> > > > average
> > > >> > > > > > > > > depending on its own internal structure, e.g. the
> > > >> redundancy
> > > >> > it
> > > >> > > > > > > > > enforces (unfortunately resulting to duplication),
> > > >> additional
> > > >> > > > > > > > > information such as checksums and primary and
> > secondary
> > > >> > > indexable
> > > >> > > > > > > > > keys. But indeed, the rlmMetadata is itself a
> lighter
> > > data
> > > >> > > > > structure
> > > >> > > > > > > > > by a factor of 10. And indeed, instead of caching
> the
> > > >> > “derived
> > > >> > > > > > > > > metadata”, only the rlmMetadata could be, which
> should
> > > >> > address
> > > >> > > > the
> > > >> > > > > > > > > concern regarding the memory occupancy of the cache.
> > > >> > > > > > > > >
> > > >> > > > > > > > > 101. I am not sure I fully understand why we would
> > need
> > > to
> > > >> > > cache
> > > >> > > > > the
> > > >> > > > > > > > > list of rlmMetadata to retain the remote size of a
> > > >> > > > topic-partition.
> > > >> > > > > > > > > Since the leader of a topic-partition is, in
> > > >> non-degenerated
> > > >> > > > cases,
> > > >> > > > > > > > > the only actor which can mutate the remote part of
> the
> > > >> > > > > > > > > topic-partition, hence its size, it could in theory
> > only
> > > >> > cache
> > > >> > > > the
> > > >> > > > > > > > > size of the remote log once it has calculated it? In
> > > which
> > > >> > case
> > > >> > > > > there
> > > >> > > > > > > > > would not be any problem regarding the size of the
> > > caching
> > > >> > > > > strategy.
> > > >> > > > > > > > > Did I miss something there?
> > > >> > > > > > > > >
> > > >> > > > > > > > > 102. There may be a few challenges to consider with
> > > >> caching:
> > > >> > > > > > > > >
> > > >> > > > > > > > > 102.1) As mentioned above, the caching strategy
> > assumes
> > > no
> > > >> > > > mutation
> > > >> > > > > > > > > outside the lifetime of a leader. While this is true
> > in
> > > >> the
> > > >> > > > normal
> > > >> > > > > > > > > course of operation, there could be accidental
> > mutation
> > > >> > outside
> > > >> > > > of
> > > >> > > > > > the
> > > >> > > > > > > > > leader and a loss of consistency between the cached
> > > state
> > > >> and
> > > >> > > the
> > > >> > > > > > > > > actual remote representation of the log. E.g.
> > > split-brain
> > > >> > > > > scenarios,
> > > >> > > > > > > > > bugs in the plugins, bugs in external systems with
> > > >> mutating
> > > >> > > > access
> > > >> > > > > on
> > > >> > > > > > > > > the derived metadata. In the worst case, a drift
> > between
> > > >> the
> > > >> > > > cached
> > > >> > > > > > > > > size and the actual size could lead to over-deleting
> > > >> remote
> > > >> > > data
> > > >> > > > > > which
> > > >> > > > > > > > > is a durability risk.
> > > >> > > > > > > > >
> > > >> > > > > > > > > The alternative you propose, by making the plugin
> the
> > > >> source
> > > >> > of
> > > >> > > > > truth
> > > >> > > > > > > > > w.r.t. to the size of the remote log, can make it
> > easier
> > > >> to
> > > >> > > avoid
> > > >> > > > > > > > > inconsistencies between plugin-managed metadata and
> > the
> > > >> > remote
> > > >> > > > log
> > > >> > > > > > > > > from the perspective of Kafka. On the other hand,
> > plugin
> > > >> > > vendors
> > > >> > > > > > would
> > > >> > > > > > > > > have to implement it with the expected efficiency to
> > > have
> > > >> it
> > > >> > > > yield
> > > >> > > > > > > > > benefits.
> > > >> > > > > > > > >
> > > >> > > > > > > > > 102.2) As you mentioned, the caching strategy in
> Kafka
> > > >> would
> > > >> > > > still
> > > >> > > > > > > > > require one iteration over the list of rlmMetadata
> > when
> > > >> the
> > > >> > > > > > leadership
> > > >> > > > > > > > > of a topic-partition is assigned to a broker, while
> > the
> > > >> > plugin
> > > >> > > > can
> > > >> > > > > > > > > offer alternative constant-time approaches. This
> > > >> calculation
> > > >> > > > cannot
> > > >> > > > > > be
> > > >> > > > > > > > > put on the LeaderAndIsr path and would be performed
> in
> > > the
> > > >> > > > > > background.
> > > >> > > > > > > > > In case of bulk leadership migration, listing the
> > > >> rlmMetadata
> > > >> > > > could
> > > >> > > > > > a)
> > > >> > > > > > > > > result in request bursts to any backend system the
> > > plugin
> > > >> may
> > > >> > > use
> > > >> > > > > > > > > [which shouldn’t be a problem for high-throughput
> data
> > > >> stores
> > > >> > > but
> > > >> > > > > > > > > could have cost implications] b) increase
> utilisation
> > > >> > timespan
> > > >> > > of
> > > >> > > > > the
> > > >> > > > > > > > > RLM threads for these calculations potentially
> leading
> > > to
> > > >> > > > transient
> > > >> > > > > > > > > starvation of tasks queued for, typically,
> offloading
> > > >> > > operations
> > > >> > > > c)
> > > >> > > > > > > > > could have a non-marginal CPU footprint on hardware
> > with
> > > >> > strict
> > > >> > > > > > > > > resource constraints. All these elements could have
> an
> > > >> impact
> > > >> > > to
> > > >> > > > > some
> > > >> > > > > > > > > degree depending on the operational environment.
> > > >> > > > > > > > >
> > > >> > > > > > > > > From a design perspective, one question is where we
> > want
> > > >> the
> > > >> > > > source
> > > >> > > > > > of
> > > >> > > > > > > > > truth w.r.t. remote log size to be during the
> lifetime
> > > of
> > > >> a
> > > >> > > > leader.
> > > >> > > > > > > > > The responsibility of maintaining a consistent
> > > >> representation
> > > >> > > of
> > > >> > > > > the
> > > >> > > > > > > > > remote log is shared by Kafka and the plugin. Which
> > > >> system is
> > > >> > > > best
> > > >> > > > > > > > > placed to maintain such a state while providing the
> > > >> highest
> > > >> > > > > > > > > consistency guarantees is something both Kafka and
> > > plugin
> > > >> > > > designers
> > > >> > > > > > > > > could help understand better.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Many thanks,
> > > >> > > > > > > > > Alexandre
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > > >> > <jun@confluent.io.invalid
> > > >> > > >
> > > >> > > > a
> > > >> > > > > > > > écrit :
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Hi, Divij,
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Thanks for the reply.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Point #1. Is the average remote segment metadata
> > > really
> > > >> > 1KB?
> > > >> > > > > What's
> > > >> > > > > > > > > listed
> > > >> > > > > > > > > > in the public interface is probably well below 100
> > > >> bytes.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Point #2. I guess you are assuming that each
> broker
> > > only
> > > >> > > caches
> > > >> > > > > the
> > > >> > > > > > > > > remote
> > > >> > > > > > > > > > segment metadata in memory. An alternative
> approach
> > is
> > > >> to
> > > >> > > cache
> > > >> > > > > > them
> > > >> > > > > > > in
> > > >> > > > > > > > > > both memory and local disk. That way, on broker
> > > restart,
> > > >> > you
> > > >> > > > just
> > > >> > > > > > > need
> > > >> > > > > > > > to
> > > >> > > > > > > > > > fetch the new remote segments' metadata using the
> > > >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > > topicIdPartition,
> > > >> > int
> > > >> > > > > > > > leaderEpoch)
> > > >> > > > > > > > > > api. Will that work?
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Point #3. Thanks for the explanation and it sounds
> > > good.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Thanks,
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Jun
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij Vaidya <
> > > >> > > > > > > divijvaidya13@gmail.com>
> > > >> > > > > > > > > > wrote:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > > Hi Jun
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > There are three points that I would like to
> > present
> > > >> here:
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > 1. We would require a large cache size to
> > > efficiently
> > > >> > cache
> > > >> > > > all
> > > >> > > > > > > > segment
> > > >> > > > > > > > > > > metadata.
> > > >> > > > > > > > > > > 2. Linear scan of all metadata at broker startup
> > to
> > > >> > > populate
> > > >> > > > > the
> > > >> > > > > > > > cache
> > > >> > > > > > > > > will
> > > >> > > > > > > > > > > be slow and will impact the archival process.
> > > >> > > > > > > > > > > 3. There is no other use case where a full scan
> of
> > > >> > segment
> > > >> > > > > > metadata
> > > >> > > > > > > > is
> > > >> > > > > > > > > > > required.
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > Let's start by quantifying 1. Here's my estimate
> > for
> > > >> the
> > > >> > > size
> > > >> > > > > of
> > > >> > > > > > > the
> > > >> > > > > > > > > cache.
> > > >> > > > > > > > > > > Average size of segment metadata = 1KB. This
> could
> > > be
> > > >> > more
> > > >> > > if
> > > >> > > > > we
> > > >> > > > > > > have
> > > >> > > > > > > > > > > frequent leader failover with a large number of
> > > leader
> > > >> > > epochs
> > > >> > > > > > being
> > > >> > > > > > > > > stored
> > > >> > > > > > > > > > > per segment.
> > > >> > > > > > > > > > > Segment size = 100MB. Users will prefer to
> reduce
> > > the
> > > >> > > segment
> > > >> > > > > > size
> > > >> > > > > > > > > from the
> > > >> > > > > > > > > > > default value of 1GB to ensure timely archival
> of
> > > data
> > > >> > > since
> > > >> > > > > data
> > > >> > > > > > > > from
> > > >> > > > > > > > > > > active segment is not archived.
> > > >> > > > > > > > > > > Cache size = num segments * avg. segment
> metadata
> > > >> size =
> > > >> > > > > > > > > (100TB/100MB)*1KB
> > > >> > > > > > > > > > > = 1GB.
> > > >> > > > > > > > > > > While 1GB for cache may not sound like a large
> > > number
> > > >> for
> > > >> > > > > larger
> > > >> > > > > > > > > machines,
> > > >> > > > > > > > > > > it does eat into the memory as an additional
> cache
> > > and
> > > >> > > makes
> > > >> > > > > use
> > > >> > > > > > > > cases
> > > >> > > > > > > > > with
> > > >> > > > > > > > > > > large data retention with low throughout
> expensive
> > > >> (where
> > > >> > > > such
> > > >> > > > > > use
> > > >> > > > > > > > case
> > > >> > > > > > > > > > > would could use smaller machines).
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > About point#2:
> > > >> > > > > > > > > > > Even if we say that all segment metadata can fit
> > > into
> > > >> the
> > > >> > > > > cache,
> > > >> > > > > > we
> > > >> > > > > > > > > will
> > > >> > > > > > > > > > > need to populate the cache on broker startup. It
> > > would
> > > >> > not
> > > >> > > be
> > > >> > > > > in
> > > >> > > > > > > the
> > > >> > > > > > > > > > > critical patch of broker startup and hence won't
> > > >> impact
> > > >> > the
> > > >> > > > > > startup
> > > >> > > > > > > > > time.
> > > >> > > > > > > > > > > But it will impact the time when we could start
> > the
> > > >> > > archival
> > > >> > > > > > > process
> > > >> > > > > > > > > since
> > > >> > > > > > > > > > > the RLM thread pool will be blocked on the first
> > > call
> > > >> to
> > > >> > > > > > > > > > > listRemoteLogSegments(). To scan metadata for
> 1MM
> > > >> > segments
> > > >> > > > > > > (computed
> > > >> > > > > > > > > above)
> > > >> > > > > > > > > > > and transfer 1GB data over the network from a
> RLMM
> > > >> such
> > > >> > as
> > > >> > > a
> > > >> > > > > > remote
> > > >> > > > > > > > > > > database would be in the order of minutes
> > (depending
> > > >> on
> > > >> > how
> > > >> > > > > > > efficient
> > > >> > > > > > > > > the
> > > >> > > > > > > > > > > scan is with the RLMM implementation).
> Although, I
> > > >> would
> > > >> > > > > concede
> > > >> > > > > > > that
> > > >> > > > > > > > > > > having RLM threads blocked for a few minutes is
> > > >> perhaps
> > > >> > OK
> > > >> > > > but
> > > >> > > > > if
> > > >> > > > > > > we
> > > >> > > > > > > > > > > introduce the new API proposed in the KIP, we
> > would
> > > >> have
> > > >> > a
> > > >> > > > > > > > > > > deterministic startup time for RLM. Adding the
> API
> > > >> comes
> > > >> > > at a
> > > >> > > > > low
> > > >> > > > > > > > cost
> > > >> > > > > > > > > and
> > > >> > > > > > > > > > > I believe the trade off is worth it.
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > About point#3:
> > > >> > > > > > > > > > > We can use
> listRemoteLogSegments(TopicIdPartition
> > > >> > > > > > topicIdPartition,
> > > >> > > > > > > > int
> > > >> > > > > > > > > > > leaderEpoch) to calculate the segments eligible
> > for
> > > >> > > deletion
> > > >> > > > > > (based
> > > >> > > > > > > > on
> > > >> > > > > > > > > size
> > > >> > > > > > > > > > > retention) where leader epoch(s) belong to the
> > > current
> > > >> > > leader
> > > >> > > > > > epoch
> > > >> > > > > > > > > chain.
> > > >> > > > > > > > > > > I understand that it may lead to segments
> > belonging
> > > to
> > > >> > > other
> > > >> > > > > > epoch
> > > >> > > > > > > > > lineage
> > > >> > > > > > > > > > > not getting deleted and would require a separate
> > > >> > mechanism
> > > >> > > to
> > > >> > > > > > > delete
> > > >> > > > > > > > > them.
> > > >> > > > > > > > > > > The separate mechanism would anyways be required
> > to
> > > >> > delete
> > > >> > > > > these
> > > >> > > > > > > > > "leaked"
> > > >> > > > > > > > > > > segments as there are other cases which could
> lead
> > > to
> > > >> > leaks
> > > >> > > > > such
> > > >> > > > > > as
> > > >> > > > > > > > > network
> > > >> > > > > > > > > > > problems with RSM mid way writing through.
> segment
> > > >> etc.
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > Thank you for the replies so far. They have made
> > me
> > > >> > > re-think
> > > >> > > > my
> > > >> > > > > > > > > assumptions
> > > >> > > > > > > > > > > and this dialogue has been very constructive for
> > me.
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > Regards,
> > > >> > > > > > > > > > > Divij Vaidya
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
> > > >> > > > > > <jun@confluent.io.invalid
> > > >> > > > > > > >
> > > >> > > > > > > > > wrote:
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > > Hi, Divij,
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Thanks for the reply.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > It's true that the data in Kafka could be kept
> > > >> longer
> > > >> > > with
> > > >> > > > > > > KIP-405.
> > > >> > > > > > > > > How
> > > >> > > > > > > > > > > > much data do you envision to have per broker?
> > For
> > > >> 100TB
> > > >> > > > data
> > > >> > > > > > per
> > > >> > > > > > > > > broker,
> > > >> > > > > > > > > > > > with 1GB segment and segment metadata of 100
> > > bytes,
> > > >> it
> > > >> > > > > requires
> > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit in
> > memory.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > > >> > listRemoteLogSegments()
> > > >> > > > > > methods.
> > > >> > > > > > > > > The one
> > > >> > > > > > > > > > > > you listed
> > listRemoteLogSegments(TopicIdPartition
> > > >> > > > > > > topicIdPartition,
> > > >> > > > > > > > > int
> > > >> > > > > > > > > > > > leaderEpoch) does return data in offset order.
> > > >> However,
> > > >> > > the
> > > >> > > > > > other
> > > >> > > > > > > > > > > > one listRemoteLogSegments(TopicIdPartition
> > > >> > > > topicIdPartition)
> > > >> > > > > > > > doesn't
> > > >> > > > > > > > > > > > specify the return order. I assume that you
> need
> > > the
> > > >> > > latter
> > > >> > > > > to
> > > >> > > > > > > > > calculate
> > > >> > > > > > > > > > > > the segment size?
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Thanks,
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Jun
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij Vaidya
> <
> > > >> > > > > > > > > divijvaidya13@gmail.com>
> > > >> > > > > > > > > > > > wrote:
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > > *Jun,*
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > *"the default implementation of RLMM does
> > local
> > > >> > > caching,
> > > >> > > > > > > right?"*
> > > >> > > > > > > > > > > > > Yes, Jun. The default implementation of RLMM
> > > does
> > > >> > > indeed
> > > >> > > > > > cache
> > > >> > > > > > > > the
> > > >> > > > > > > > > > > > segment
> > > >> > > > > > > > > > > > > metadata today, hence, it won't work for use
> > > cases
> > > >> > when
> > > >> > > > the
> > > >> > > > > > > > number
> > > >> > > > > > > > > of
> > > >> > > > > > > > > > > > > segments in remote storage is large enough
> to
> > > >> exceed
> > > >> > > the
> > > >> > > > > size
> > > >> > > > > > > of
> > > >> > > > > > > > > cache.
> > > >> > > > > > > > > > > > As
> > > >> > > > > > > > > > > > > part of this KIP, I will implement the new
> > > >> proposed
> > > >> > API
> > > >> > > > in
> > > >> > > > > > the
> > > >> > > > > > > > > default
> > > >> > > > > > > > > > > > > implementation of RLMM but the underlying
> > > >> > > implementation
> > > >> > > > > will
> > > >> > > > > > > > > still be
> > > >> > > > > > > > > > > a
> > > >> > > > > > > > > > > > > scan. I will pick up optimizing that in a
> > > separate
> > > >> > PR.
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > *"we also cache all segment metadata in the
> > > >> brokers
> > > >> > > > without
> > > >> > > > > > > > > KIP-405. Do
> > > >> > > > > > > > > > > > you
> > > >> > > > > > > > > > > > > see a need to change that?"*
> > > >> > > > > > > > > > > > > Please correct me if I am wrong here but we
> > > cache
> > > >> > > > metadata
> > > >> > > > > > for
> > > >> > > > > > > > > segments
> > > >> > > > > > > > > > > > > "residing in local storage". The size of the
> > > >> current
> > > >> > > > cache
> > > >> > > > > > > works
> > > >> > > > > > > > > fine
> > > >> > > > > > > > > > > for
> > > >> > > > > > > > > > > > > the scale of the number of segments that we
> > > >> expect to
> > > >> > > > store
> > > >> > > > > > in
> > > >> > > > > > > > > local
> > > >> > > > > > > > > > > > > storage. After KIP-405, that cache will
> > continue
> > > >> to
> > > >> > > store
> > > >> > > > > > > > metadata
> > > >> > > > > > > > > for
> > > >> > > > > > > > > > > > > segments which are residing in local storage
> > and
> > > >> > hence,
> > > >> > > > we
> > > >> > > > > > > don't
> > > >> > > > > > > > > need
> > > >> > > > > > > > > > > to
> > > >> > > > > > > > > > > > > change that. For segments which have been
> > > >> offloaded
> > > >> > to
> > > >> > > > > remote
> > > >> > > > > > > > > storage,
> > > >> > > > > > > > > > > it
> > > >> > > > > > > > > > > > > would rely on RLMM. Note that the scale of
> > data
> > > >> > stored
> > > >> > > in
> > > >> > > > > > RLMM
> > > >> > > > > > > is
> > > >> > > > > > > > > > > > different
> > > >> > > > > > > > > > > > > from local cache because the number of
> > segments
> > > is
> > > >> > > > expected
> > > >> > > > > > to
> > > >> > > > > > > be
> > > >> > > > > > > > > much
> > > >> > > > > > > > > > > > > larger than what current implementation
> stores
> > > in
> > > >> > local
> > > >> > > > > > > storage.
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > 2,3,4:
> > > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > >> > > > > does
> > > >> > > > > > > > > specify
> > > >> > > > > > > > > > > the
> > > >> > > > > > > > > > > > > order i.e. it returns the segments sorted by
> > > first
> > > >> > > offset
> > > >> > > > > in
> > > >> > > > > > > > > ascending
> > > >> > > > > > > > > > > > > order. I am copying the API docs for KIP-405
> > > here
> > > >> for
> > > >> > > > your
> > > >> > > > > > > > > reference
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > *Returns iterator of remote log segment
> > > metadata,
> > > >> > > sorted
> > > >> > > > by
> > > >> > > > > > > > {@link
> > > >> > > > > > > > > > > > > RemoteLogSegmentMetadata#startOffset()}
> > > >> inascending
> > > >> > > order
> > > >> > > > > > which
> > > >> > > > > > > > > > > contains
> > > >> > > > > > > > > > > > > the given leader epoch. This is used by
> remote
> > > log
> > > >> > > > > retention
> > > >> > > > > > > > > management
> > > >> > > > > > > > > > > > > subsystemto fetch the segment metadata for a
> > > given
> > > >> > > leader
> > > >> > > > > > > > > epoch.@param
> > > >> > > > > > > > > > > > > topicIdPartition topic partition@param
> > > >> leaderEpoch
> > > >> > > > > > leader
> > > >> > > > > > > > > > > > > epoch@return
> > > >> > > > > > > > > > > > > Iterator of remote segments, sorted by start
> > > >> offset
> > > >> > in
> > > >> > > > > > > ascending
> > > >> > > > > > > > > > > order. *
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > *Luke,*
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > 5. Note that we are trying to optimize the
> > > >> efficiency
> > > >> > > of
> > > >> > > > > size
> > > >> > > > > > > > based
> > > >> > > > > > > > > > > > > retention for remote storage. KIP-405 does
> not
> > > >> > > introduce
> > > >> > > > a
> > > >> > > > > > new
> > > >> > > > > > > > > config
> > > >> > > > > > > > > > > for
> > > >> > > > > > > > > > > > > periodically checking remote similar to
> > > >> > > > > > > > > > > log.retention.check.interval.ms
> > > >> > > > > > > > > > > > > which is applicable for remote storage.
> Hence,
> > > the
> > > >> > > metric
> > > >> > > > > > will
> > > >> > > > > > > be
> > > >> > > > > > > > > > > updated
> > > >> > > > > > > > > > > > > at the time of invoking log retention check
> > for
> > > >> > remote
> > > >> > > > tier
> > > >> > > > > > > which
> > > >> > > > > > > > > is
> > > >> > > > > > > > > > > > > pending implementation today. We can perhaps
> > > come
> > > >> > back
> > > >> > > > and
> > > >> > > > > > > update
> > > >> > > > > > > > > the
> > > >> > > > > > > > > > > > > metric description after the implementation
> of
> > > log
> > > >> > > > > retention
> > > >> > > > > > > > check
> > > >> > > > > > > > > in
> > > >> > > > > > > > > > > > > RemoteLogManager.
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > --
> > > >> > > > > > > > > > > > > Divij Vaidya
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke Chen <
> > > >> > > > > showuon@gmail.com
> > > >> > > > > > >
> > > >> > > > > > > > > wrote:
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > Hi Divij,
> > > >> > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > One more question about the metric:
> > > >> > > > > > > > > > > > > > I think the metric will be updated when
> > > >> > > > > > > > > > > > > > (1) each time we run the log retention
> check
> > > >> (that
> > > >> > > is,
> > > >> > > > > > > > > > > > > > log.retention.check.interval.ms)
> > > >> > > > > > > > > > > > > > (2) When user explicitly call
> > getRemoteLogSize
> > > >> > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > Is that correct?
> > > >> > > > > > > > > > > > > > Maybe we should add a note in metric
> > > >> description,
> > > >> > > > > > otherwise,
> > > >> > > > > > > > when
> > > >> > > > > > > > > > > user
> > > >> > > > > > > > > > > > > got,
> > > >> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes, will be
> > > >> > surprised.
> > > >> > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > Otherwise, LGTM
> > > >> > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > Thank you for the KIP
> > > >> > > > > > > > > > > > > > Luke
> > > >> > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun Rao
> > > >> > > > > > > > <jun@confluent.io.invalid
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > > > > wrote:
> > > >> > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > Hi, Divij,
> > > >> > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > Thanks for the explanation.
> > > >> > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > 1. Hmm, the default implementation of
> RLMM
> > > >> does
> > > >> > > local
> > > >> > > > > > > > caching,
> > > >> > > > > > > > > > > right?
> > > >> > > > > > > > > > > > > > > Currently, we also cache all segment
> > > metadata
> > > >> in
> > > >> > > the
> > > >> > > > > > > brokers
> > > >> > > > > > > > > > > without
> > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need to change
> that?
> > > >> > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation makes
> sense.
> > > >> > However,
> > > >> > > > > > > > > > > > > > > currently,
> > > >> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > >> > > > > > > > > doesn't
> > > >> > > > > > > > > > > > > > specify
> > > >> > > > > > > > > > > > > > > a particular order of the iterator. Do
> you
> > > >> intend
> > > >> > > to
> > > >> > > > > > change
> > > >> > > > > > > > > that?
> > > >> > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > Thanks,
> > > >> > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > Jun
> > > >> > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM Divij
> > Vaidya
> > > <
> > > >> > > > > > > > > > > divijvaidya13@gmail.com
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > wrote:
> > > >> > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > Hey Jun
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > Thank you for your comments.
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor could ensure
> that
> > > >> > > > > > > > > listRemoteLogSegments()
> > > >> > > > > > > > > > > is
> > > >> > > > > > > > > > > > > > fast"*
> > > >> > > > > > > > > > > > > > > > This would be ideal but pragmatically,
> > it
> > > is
> > > >> > > > > difficult
> > > >> > > > > > to
> > > >> > > > > > > > > ensure
> > > >> > > > > > > > > > > > that
> > > >> > > > > > > > > > > > > > > > listRemoteLogSegments() is fast. This
> is
> > > >> > because
> > > >> > > of
> > > >> > > > > the
> > > >> > > > > > > > > > > possibility
> > > >> > > > > > > > > > > > > of
> > > >> > > > > > > > > > > > > > a
> > > >> > > > > > > > > > > > > > > > large number of segments (much larger
> > than
> > > >> what
> > > >> > > > Kafka
> > > >> > > > > > > > > currently
> > > >> > > > > > > > > > > > > handles
> > > >> > > > > > > > > > > > > > > > with local storage today) would make
> it
> > > >> > > infeasible
> > > >> > > > to
> > > >> > > > > > > adopt
> > > >> > > > > > > > > > > > > strategies
> > > >> > > > > > > > > > > > > > > such
> > > >> > > > > > > > > > > > > > > > as local caching to improve the
> > > performance
> > > >> of
> > > >> > > > > > > > > > > > listRemoteLogSegments.
> > > >> > > > > > > > > > > > > > > Apart
> > > >> > > > > > > > > > > > > > > > from caching (which won't work due to
> > size
> > > >> > > > > > limitations) I
> > > >> > > > > > > > > can't
> > > >> > > > > > > > > > > > think
> > > >> > > > > > > > > > > > > > of
> > > >> > > > > > > > > > > > > > > > other strategies which may eliminate
> the
> > > >> need
> > > >> > for
> > > >> > > > IO
> > > >> > > > > > > > > > > > > > > > operations proportional to the number
> of
> > > >> total
> > > >> > > > > > segments.
> > > >> > > > > > > > > Please
> > > >> > > > > > > > > > > > > advise
> > > >> > > > > > > > > > > > > > if
> > > >> > > > > > > > > > > > > > > > you have something in mind.
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds the
> retention
> > > >> size,
> > > >> > we
> > > >> > > > need
> > > >> > > > > > to
> > > >> > > > > > > > > > > determine
> > > >> > > > > > > > > > > > > the
> > > >> > > > > > > > > > > > > > > > subset of segments to delete to bring
> > the
> > > >> size
> > > >> > > > within
> > > >> > > > > > the
> > > >> > > > > > > > > > > retention
> > > >> > > > > > > > > > > > > > > limit.
> > > >> > > > > > > > > > > > > > > > Do we need to call
> > > >> > > > > > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > >> > > > > > > > > > > > > to
> > > >> > > > > > > > > > > > > > > > determine that?"*
> > > >> > > > > > > > > > > > > > > > Yes, we need to call
> > > >> listRemoteLogSegments() to
> > > >> > > > > > determine
> > > >> > > > > > > > > which
> > > >> > > > > > > > > > > > > > segments
> > > >> > > > > > > > > > > > > > > > should be deleted. But there is a
> > > difference
> > > >> > with
> > > >> > > > the
> > > >> > > > > > use
> > > >> > > > > > > > > case we
> > > >> > > > > > > > > > > > are
> > > >> > > > > > > > > > > > > > > > trying to optimize with this KIP. To
> > > >> determine
> > > >> > > the
> > > >> > > > > > subset
> > > >> > > > > > > > of
> > > >> > > > > > > > > > > > segments
> > > >> > > > > > > > > > > > > > > which
> > > >> > > > > > > > > > > > > > > > would be deleted, we only read
> metadata
> > > for
> > > >> > > > segments
> > > >> > > > > > > which
> > > >> > > > > > > > > would
> > > >> > > > > > > > > > > be
> > > >> > > > > > > > > > > > > > > deleted
> > > >> > > > > > > > > > > > > > > > via the listRemoteLogSegments(). But
> to
> > > >> > determine
> > > >> > > > the
> > > >> > > > > > > > > > > totalLogSize,
> > > >> > > > > > > > > > > > > > which
> > > >> > > > > > > > > > > > > > > > is required every time retention logic
> > > >> based on
> > > >> > > > size
> > > >> > > > > > > > > executes, we
> > > >> > > > > > > > > > > > > read
> > > >> > > > > > > > > > > > > > > > metadata of *all* the segments in
> remote
> > > >> > storage.
> > > >> > > > > > Hence,
> > > >> > > > > > > > the
> > > >> > > > > > > > > > > number
> > > >> > > > > > > > > > > > > of
> > > >> > > > > > > > > > > > > > > > results returned by
> > > >> > > > > > > > > > > >
> > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > >> > > > > > > > > > > > > > *is
> > > >> > > > > > > > > > > > > > > > different when we are calculating
> > > >> totalLogSize
> > > >> > > vs.
> > > >> > > > > when
> > > >> > > > > > > we
> > > >> > > > > > > > > are
> > > >> > > > > > > > > > > > > > > determining
> > > >> > > > > > > > > > > > > > > > the subset of segments to delete.
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > 3.
> > > >> > > > > > > > > > > > > > > > *"Also, what about time-based
> retention?
> > > To
> > > >> > make
> > > >> > > > that
> > > >> > > > > > > > > efficient,
> > > >> > > > > > > > > > > do
> > > >> > > > > > > > > > > > > we
> > > >> > > > > > > > > > > > > > > need
> > > >> > > > > > > > > > > > > > > > to make some additional interface
> > > >> changes?"*No.
> > > >> > > > Note
> > > >> > > > > > that
> > > >> > > > > > > > > time
> > > >> > > > > > > > > > > > > > complexity
> > > >> > > > > > > > > > > > > > > > to determine the segments for
> retention
> > is
> > > >> > > > different
> > > >> > > > > > for
> > > >> > > > > > > > time
> > > >> > > > > > > > > > > based
> > > >> > > > > > > > > > > > > vs.
> > > >> > > > > > > > > > > > > > > > size based. For time based, the time
> > > >> complexity
> > > >> > > is
> > > >> > > > a
> > > >> > > > > > > > > function of
> > > >> > > > > > > > > > > > the
> > > >> > > > > > > > > > > > > > > number
> > > >> > > > > > > > > > > > > > > > of segments which are "eligible for
> > > >> deletion"
> > > >> > > > (since
> > > >> > > > > we
> > > >> > > > > > > > only
> > > >> > > > > > > > > read
> > > >> > > > > > > > > > > > > > > metadata
> > > >> > > > > > > > > > > > > > > > for segments which would be deleted)
> > > >> whereas in
> > > >> > > > size
> > > >> > > > > > > based
> > > >> > > > > > > > > > > > retention,
> > > >> > > > > > > > > > > > > > the
> > > >> > > > > > > > > > > > > > > > time complexity is a function of "all
> > > >> segments"
> > > >> > > > > > available
> > > >> > > > > > > > in
> > > >> > > > > > > > > > > remote
> > > >> > > > > > > > > > > > > > > storage
> > > >> > > > > > > > > > > > > > > > (metadata of all segments needs to be
> > read
> > > >> to
> > > >> > > > > calculate
> > > >> > > > > > > the
> > > >> > > > > > > > > total
> > > >> > > > > > > > > > > > > > size).
> > > >> > > > > > > > > > > > > > > As
> > > >> > > > > > > > > > > > > > > > you may observe, this KIP will bring
> the
> > > >> time
> > > >> > > > > > complexity
> > > >> > > > > > > > for
> > > >> > > > > > > > > both
> > > >> > > > > > > > > > > > > time
> > > >> > > > > > > > > > > > > > > > based retention & size based retention
> > to
> > > >> the
> > > >> > > same
> > > >> > > > > > > > function.
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > 4. Also, please note that this new API
> > > >> > introduced
> > > >> > > > in
> > > >> > > > > > this
> > > >> > > > > > > > KIP
> > > >> > > > > > > > > > > also
> > > >> > > > > > > > > > > > > > > enables
> > > >> > > > > > > > > > > > > > > > us to provide a metric for total size
> of
> > > >> data
> > > >> > > > stored
> > > >> > > > > in
> > > >> > > > > > > > > remote
> > > >> > > > > > > > > > > > > storage.
> > > >> > > > > > > > > > > > > > > > Without the API, calculation of this
> > > metric
> > > >> > will
> > > >> > > > > become
> > > >> > > > > > > > very
> > > >> > > > > > > > > > > > > expensive
> > > >> > > > > > > > > > > > > > > with
> > > >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> > > >> > > > > > > > > > > > > > > > I understand that your motivation here
> > is
> > > to
> > > >> > > avoid
> > > >> > > > > > > > polluting
> > > >> > > > > > > > > the
> > > >> > > > > > > > > > > > > > > interface
> > > >> > > > > > > > > > > > > > > > with optimization specific APIs and I
> > will
> > > >> > agree
> > > >> > > > with
> > > >> > > > > > > that
> > > >> > > > > > > > > goal.
> > > >> > > > > > > > > > > > But
> > > >> > > > > > > > > > > > > I
> > > >> > > > > > > > > > > > > > > > believe that this new API proposed in
> > the
> > > >> KIP
> > > >> > > > brings
> > > >> > > > > in
> > > >> > > > > > > > > > > significant
> > > >> > > > > > > > > > > > > > > > improvement and there is no other work
> > > >> around
> > > >> > > > > available
> > > >> > > > > > > to
> > > >> > > > > > > > > > > achieve
> > > >> > > > > > > > > > > > > the
> > > >> > > > > > > > > > > > > > > same
> > > >> > > > > > > > > > > > > > > > performance.
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > Regards,
> > > >> > > > > > > > > > > > > > > > Divij Vaidya
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM Jun
> Rao
> > > >> > > > > > > > > <jun@confluent.io.invalid
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > wrote:
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > > >> > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry for the
> late
> > > >> reply.
> > > >> > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > The motivation of the KIP is to
> > improve
> > > >> the
> > > >> > > > > > efficiency
> > > >> > > > > > > of
> > > >> > > > > > > > > size
> > > >> > > > > > > > > > > > > based
> > > >> > > > > > > > > > > > > > > > > retention. I am not sure the
> proposed
> > > >> changes
> > > >> > > are
> > > >> > > > > > > enough.
> > > >> > > > > > > > > For
> > > >> > > > > > > > > > > > > > example,
> > > >> > > > > > > > > > > > > > > if
> > > >> > > > > > > > > > > > > > > > > the size exceeds the retention size,
> > we
> > > >> need
> > > >> > to
> > > >> > > > > > > determine
> > > >> > > > > > > > > the
> > > >> > > > > > > > > > > > > subset
> > > >> > > > > > > > > > > > > > of
> > > >> > > > > > > > > > > > > > > > > segments to delete to bring the size
> > > >> within
> > > >> > the
> > > >> > > > > > > retention
> > > >> > > > > > > > > > > limit.
> > > >> > > > > > > > > > > > Do
> > > >> > > > > > > > > > > > > > we
> > > >> > > > > > > > > > > > > > > > need
> > > >> > > > > > > > > > > > > > > > > to call
> > > >> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > >> > > > > > > > to
> > > >> > > > > > > > > > > > > determine
> > > >> > > > > > > > > > > > > > > > that?
> > > >> > > > > > > > > > > > > > > > > Also, what about time-based
> retention?
> > > To
> > > >> > make
> > > >> > > > that
> > > >> > > > > > > > > efficient,
> > > >> > > > > > > > > > > do
> > > >> > > > > > > > > > > > > we
> > > >> > > > > > > > > > > > > > > need
> > > >> > > > > > > > > > > > > > > > > to make some additional interface
> > > changes?
> > > >> > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > An alternative approach is for the
> > RLMM
> > > >> > > > implementor
> > > >> > > > > > to
> > > >> > > > > > > > make
> > > >> > > > > > > > > > > sure
> > > >> > > > > > > > > > > > > > > > > that
> > > >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > >> > > > > > > is
> > > >> > > > > > > > > fast
> > > >> > > > > > > > > > > > > (e.g.,
> > > >> > > > > > > > > > > > > > > with
> > > >> > > > > > > > > > > > > > > > > local caching). This way, we could
> > keep
> > > >> the
> > > >> > > > > interface
> > > >> > > > > > > > > simple.
> > > >> > > > > > > > > > > > Have
> > > >> > > > > > > > > > > > > we
> > > >> > > > > > > > > > > > > > > > > considered that?
> > > >> > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > Thanks,
> > > >> > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > Jun
> > > >> > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28 AM
> Divij
> > > >> Vaidya
> > > >> > <
> > > >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > >> > > > > > > > > > > > > > > > > wrote:
> > > >> > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > Hey folks
> > > >> > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > Does anyone else have any thoughts
> > on
> > > >> this
> > > >> > > > > before I
> > > >> > > > > > > > > propose
> > > >> > > > > > > > > > > > this
> > > >> > > > > > > > > > > > > > for
> > > >> > > > > > > > > > > > > > > a
> > > >> > > > > > > > > > > > > > > > > > vote?
> > > >> > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > --
> > > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > > >> > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57 PM
> > Satish
> > > >> > > Duggana
> > > >> > > > <
> > > >> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> > > >> > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > wrote:
> > > >> > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP Divij!
> > > >> > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > This is a nice improvement to
> > avoid
> > > >> > > > > recalculation
> > > >> > > > > > > of
> > > >> > > > > > > > > size.
> > > >> > > > > > > > > > > > > > > Customized
> > > >> > > > > > > > > > > > > > > > > > RLMMs
> > > >> > > > > > > > > > > > > > > > > > > can implement the best possible
> > > >> approach
> > > >> > by
> > > >> > > > > > caching
> > > >> > > > > > > > or
> > > >> > > > > > > > > > > > > > maintaining
> > > >> > > > > > > > > > > > > > > > the
> > > >> > > > > > > > > > > > > > > > > > size
> > > >> > > > > > > > > > > > > > > > > > > in an efficient way. But this is
> > > not a
> > > >> > big
> > > >> > > > > > concern
> > > >> > > > > > > > for
> > > >> > > > > > > > > the
> > > >> > > > > > > > > > > > > > default
> > > >> > > > > > > > > > > > > > > > > topic
> > > >> > > > > > > > > > > > > > > > > > > based RLMM as mentioned in the
> > KIP.
> > > >> > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > > >> > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at 18:48,
> > Divij
> > > >> > Vaidya
> > > >> > > <
> > > >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > > >> > > > > > > > > > > > > > > > > > > wrote:
> > > >> > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > Thank you for your review
> Luke.
> > > >> > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that would the new
> > > >> > > > > > `RemoteLogSizeBytes`
> > > >> > > > > > > > > metric
> > > >> > > > > > > > > > > > be a
> > > >> > > > > > > > > > > > > > > > > > performance
> > > >> > > > > > > > > > > > > > > > > > > > overhead? Although we move the
> > > >> > > calculation
> > > >> > > > > to a
> > > >> > > > > > > > > seperate
> > > >> > > > > > > > > > > > API,
> > > >> > > > > > > > > > > > > > we
> > > >> > > > > > > > > > > > > > > > > still
> > > >> > > > > > > > > > > > > > > > > > > > can't assume users will
> > implement
> > > a
> > > >> > > > > > light-weight
> > > >> > > > > > > > > method,
> > > >> > > > > > > > > > > > > right?
> > > >> > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > This metric would be logged
> > using
> > > >> the
> > > >> > > > > > information
> > > >> > > > > > > > > that is
> > > >> > > > > > > > > > > > > > already
> > > >> > > > > > > > > > > > > > > > > being
> > > >> > > > > > > > > > > > > > > > > > > > calculated for handling remote
> > > >> > retention
> > > >> > > > > logic,
> > > >> > > > > > > > > hence, no
> > > >> > > > > > > > > > > > > > > > additional
> > > >> > > > > > > > > > > > > > > > > > work
> > > >> > > > > > > > > > > > > > > > > > > > is required to calculate this
> > > >> metric.
> > > >> > > More
> > > >> > > > > > > > > specifically,
> > > >> > > > > > > > > > > > > > whenever
> > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls
> > > >> getRemoteLogSize
> > > >> > > > API,
> > > >> > > > > > this
> > > >> > > > > > > > > metric
> > > >> > > > > > > > > > > > > would
> > > >> > > > > > > > > > > > > > be
> > > >> > > > > > > > > > > > > > > > > > > captured.
> > > >> > > > > > > > > > > > > > > > > > > > This API call is made every
> time
> > > >> > > > > > RemoteLogManager
> > > >> > > > > > > > > wants
> > > >> > > > > > > > > > > to
> > > >> > > > > > > > > > > > > > handle
> > > >> > > > > > > > > > > > > > > > > > expired
> > > >> > > > > > > > > > > > > > > > > > > > remote log segments (which
> > should
> > > be
> > > >> > > > > periodic).
> > > >> > > > > > > > Does
> > > >> > > > > > > > > that
> > > >> > > > > > > > > > > > > > address
> > > >> > > > > > > > > > > > > > > > > your
> > > >> > > > > > > > > > > > > > > > > > > > concern?
> > > >> > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > >> > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at 11:01
> AM
> > > >> Luke
> > > >> > > Chen
> > > >> > > > <
> > > >> > > > > > > > > > > > > showuon@gmail.com>
> > > >> > > > > > > > > > > > > > > > > wrote:
> > > >> > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > > >> > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP!
> > > >> > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > I think it makes sense to
> > > delegate
> > > >> > the
> > > >> > > > > > > > > responsibility
> > > >> > > > > > > > > > > of
> > > >> > > > > > > > > > > > > > > > > calculation
> > > >> > > > > > > > > > > > > > > > > > to
> > > >> > > > > > > > > > > > > > > > > > > > the
> > > >> > > > > > > > > > > > > > > > > > > > > specific
> > > RemoteLogMetadataManager
> > > >> > > > > > > implementation.
> > > >> > > > > > > > > > > > > > > > > > > > > But one thing I'm not quite
> > > sure,
> > > >> is
> > > >> > > that
> > > >> > > > > > would
> > > >> > > > > > > > > the new
> > > >> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes` metric
> > be a
> > > >> > > > > performance
> > > >> > > > > > > > > overhead?
> > > >> > > > > > > > > > > > > > > > > > > > > Although we move the
> > calculation
> > > >> to a
> > > >> > > > > > seperate
> > > >> > > > > > > > > API, we
> > > >> > > > > > > > > > > > > still
> > > >> > > > > > > > > > > > > > > > can't
> > > >> > > > > > > > > > > > > > > > > > > assume
> > > >> > > > > > > > > > > > > > > > > > > > > users will implement a
> > > >> light-weight
> > > >> > > > method,
> > > >> > > > > > > > right?
> > > >> > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > >> > > > > > > > > > > > > > > > > > > > > Luke
> > > >> > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at 5:47
> PM
> > > >> Divij
> > > >> > > > > Vaidya <
> > > >> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> > > >> > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > > >> > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > >> > > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > > >> > > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > > Please take a look at this
> > KIP
> > > >> > which
> > > >> > > > > > proposes
> > > >> > > > > > > > an
> > > >> > > > > > > > > > > > > extension
> > > >> > > > > > > > > > > > > > to
> > > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > > >> > > > > > > > > > > > > > > > > > > > > This
> > > >> > > > > > > > > > > > > > > > > > > > > > is my first KIP with
> Apache
> > > >> Kafka
> > > >> > > > > community
> > > >> > > > > > > so
> > > >> > > > > > > > > any
> > > >> > > > > > > > > > > > > feedback
> > > >> > > > > > > > > > > > > > > > would
> > > >> > > > > > > > > > > > > > > > > > be
> > > >> > > > > > > > > > > > > > > > > > > > > highly
> > > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > > >> > > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > > >> > > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > > > --
> > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software Engineer
> > > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > > >> > > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > >
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Divij Vaidya <di...@gmail.com>.
Thank you for looking into this Kamal.

You are right in saying that a cold start (i.e. leadership failover or
broker startup) does not impact the broker startup duration. But it does
have the following impact:
1. It leads to a burst of full-scan requests to RLMM in case multiple
leadership failovers occur at the same time. Even if the RLMM
implementation has the capability to serve the total size from an index
(and hence handle this burst), we wouldn't be able to use it since the
current API necessarily calls for a full scan.
2. The archival (copying of data to tiered storage) process will have a
delayed start. The delayed start of archival could lead to local build up
of data which may lead to disk full.

The disadvantage of adding this new API is that every provider will have to
implement it, agreed. But I believe that this tradeoff is worthwhile since
the default implementation could be the same as you mentioned, i.e. keeping
cumulative in-memory count.

--
Divij Vaidya



On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash <
kamal.chandraprakash@gmail.com> wrote:

> Hi Divij,
>
> Thanks for the KIP! Sorry for the late reply.
>
> Can you explain the rejected alternative-3?
> Store the cumulative size of remote tier log in-memory at RemoteLogManager
> "*Cons*: Every time a broker starts-up, it will scan through all the
> segments in the remote tier to initialise the in-memory value. This would
> increase the broker start-up time."
>
> Keeping the source of truth to determine the remote-log-size in the leader
> would be consistent across different implementations of the plugin. The
> concern posted in the KIP is that we are calculating the remote-log-size on
> each iteration of the cleaner thread (say 5 mins). If we calculate only
> once during broker startup or during the leadership reassignment, do we
> still need the cache?
>
> The broker startup-time won't be affected by the remote log manager
> initialisation. The broker continue to start accepting the new
> produce/fetch requests, while the RLM thread in the background can
> determine the remote-log-size once and start copying/deleting the segments.
>
> Thanks,
> Kamal
>
> On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <di...@gmail.com>
> wrote:
>
> > Satish / Jun
> >
> > Do you have any thoughts on this?
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <di...@gmail.com>
> > wrote:
> >
> > > Hey Jun
> > >
> > > It has been a while since this KIP got some attention. While we wait
> for
> > > Satish to chime in here, perhaps I can answer your question.
> > >
> > > > Could you explain how you exposed the log size in your KIP-405
> > > implementation?
> > >
> > > The APIs available in RLMM as per KIP405
> > > are, addRemoteLogSegmentMetadata(), updateRemoteLogSegmentMetadata(),
> > remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> > putRemotePartitionDeleteMetadata(), listRemoteLogSegments(),
> > onPartitionLeadershipChanges()
> > > and onStopPartitions(). None of these APIs allow us to expose the log
> > size,
> > > hence, the only option that remains is to list all segments using
> > > listRemoteLogSegments() and aggregate them every time we require to
> > > calculate the size. Based on our prior discussion, this requires
> reading
> > > all segment metadata which won't work for non-local RLMM
> implementations.
> > > Satish's implementation also performs a full scan and calculates the
> > > aggregate. see:
> > >
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> > >
> > >
> > > Does this answer your question?
> > >
> > > --
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao <ju...@confluent.io.invalid>
> > wrote:
> > >
> > >> Hi, Divij,
> > >>
> > >> Thanks for the explanation.
> > >>
> > >> Good question.
> > >>
> > >> Hi, Satish,
> > >>
> > >> Could you explain how you exposed the log size in your KIP-405
> > >> implementation?
> > >>
> > >> Thanks,
> > >>
> > >> Jun
> > >>
> > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <divijvaidya13@gmail.com
> >
> > >> wrote:
> > >>
> > >> > Hey Jun
> > >> >
> > >> > Yes, it is possible to maintain the log size in the cache (see
> > rejected
> > >> > alternative#3 in the KIP) but I did not understand how it is
> possible
> > to
> > >> > retrieve it without the new API. The log size could be calculated on
> > >> > startup by scanning through the segments (though I would disagree
> that
> > >> this
> > >> > is the right approach since scanning itself takes order of minutes
> and
> > >> > hence delay the start of archive process), and incrementally
> > maintained
> > >> > afterwards, even then, we would need an API in
> > RemoteLogMetadataManager
> > >> so
> > >> > that RLM could fetch the cached size!
> > >> >
> > >> > If we wish to cache the size without adding a new API, then we need
> to
> > >> > cache the size in RLM itself (instead of RLMM implementation) and
> > >> > incrementally manage it. The downside of longer archive time at
> > startup
> > >> > (due to initial scale) still remains valid in this situation.
> > >> >
> > >> > --
> > >> > Divij Vaidya
> > >> >
> > >> >
> > >> >
> > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao <ju...@confluent.io.invalid>
> > >> wrote:
> > >> >
> > >> > > Hi, Divij,
> > >> > >
> > >> > > Thanks for the explanation.
> > >> > >
> > >> > > If there is in-memory cache, could we maintain the log size in the
> > >> cache
> > >> > > with the existing API? For example, a replica could make a
> > >> > > listRemoteLogSegments(TopicIdPartition topicIdPartition) call on
> > >> startup
> > >> > to
> > >> > > get the remote segment size before the current leaderEpoch. The
> > leader
> > >> > > could then maintain the size incrementally afterwards. On leader
> > >> change,
> > >> > > other replicas can make a listRemoteLogSegments(TopicIdPartition
> > >> > > topicIdPartition, int leaderEpoch) call to get the size of newly
> > >> > generated
> > >> > > segments.
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Jun
> > >> > >
> > >> > >
> > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> > divijvaidya13@gmail.com
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > > Is the new method enough for doing size-based retention?
> > >> > > >
> > >> > > > Yes. You are right in assuming that this API only provides the
> > >> Remote
> > >> > > > storage size (for current epoch chain). We would use this API
> for
> > >> size
> > >> > > > based retention along with a value of localOnlyLogSegmentSize
> > which
> > >> is
> > >> > > > computed as Log.sizeInBytes(logSegments.filter(_.baseOffset >
> > >> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size =
> > >> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I have
> updated
> > >> the
> > >> > KIP
> > >> > > > with this information. You can also check an example
> > implementation
> > >> at
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> > >> > > >
> > >> > > >
> > >> > > > > Do you imagine all accesses to remote metadata will be across
> > the
> > >> > > network
> > >> > > > or will there be some local in-memory cache?
> > >> > > >
> > >> > > > I would expect a disk-less implementation to maintain a finite
> > >> > in-memory
> > >> > > > cache for segment metadata to optimize the number of network
> calls
> > >> made
> > >> > > to
> > >> > > > fetch the data. In future, we can think about bringing this
> finite
> > >> size
> > >> > > > cache into RLM itself but that's probably a conversation for a
> > >> > different
> > >> > > > KIP. There are many other things we would like to do to optimize
> > the
> > >> > > Tiered
> > >> > > > storage interface such as introducing a circular buffer /
> > streaming
> > >> > > > interface from RSM (so that we don't have to wait to fetch the
> > >> entire
> > >> > > > segment before starting to send records to the consumer),
> caching
> > >> the
> > >> > > > segments fetched from RSM locally (I would assume all RSM plugin
> > >> > > > implementations to do this, might as well add it to RLM) etc.
> > >> > > >
> > >> > > > --
> > >> > > > Divij Vaidya
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao
> <jun@confluent.io.invalid
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > > Hi, Divij,
> > >> > > > >
> > >> > > > > Thanks for the reply.
> > >> > > > >
> > >> > > > > Is the new method enough for doing size-based retention? It
> > gives
> > >> the
> > >> > > > total
> > >> > > > > size of the remote segments, but it seems that we still don't
> > know
> > >> > the
> > >> > > > > exact total size for a log since there could be overlapping
> > >> segments
> > >> > > > > between the remote and the local segments.
> > >> > > > >
> > >> > > > > You mentioned a disk-less implementation. Do you imagine all
> > >> accesses
> > >> > > to
> > >> > > > > remote metadata will be across the network or will there be
> some
> > >> > local
> > >> > > > > in-memory cache?
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > >
> > >> > > > > Jun
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> > >> divijvaidya13@gmail.com
> > >> > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > The method is needed for RLMM implementations which fetch
> the
> > >> > > > information
> > >> > > > > > over the network and not for the disk based implementations
> > >> (such
> > >> > as
> > >> > > > the
> > >> > > > > > default topic based RLMM).
> > >> > > > > >
> > >> > > > > > I would argue that adding this API makes the interface more
> > >> generic
> > >> > > > than
> > >> > > > > > what it is today. This is because, with the current APIs an
> > >> > > implementor
> > >> > > > > is
> > >> > > > > > restricted to use disk based RLMM solutions only (i.e. the
> > >> default
> > >> > > > > > solution) whereas if we add this new API, we unblock usage
> of
> > >> > network
> > >> > > > > based
> > >> > > > > > RLMM implementations such as databases.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> > <jun@confluent.io.invalid
> > >> >
> > >> > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi, Divij,
> > >> > > > > > >
> > >> > > > > > > Thanks for the reply.
> > >> > > > > > >
> > >> > > > > > > Point#2. My high level question is that is the new method
> > >> needed
> > >> > > for
> > >> > > > > > every
> > >> > > > > > > implementation of remote storage or just for a specific
> > >> > > > implementation.
> > >> > > > > > The
> > >> > > > > > > issues that you pointed out exist for the default
> > >> implementation
> > >> > of
> > >> > > > > RLMM
> > >> > > > > > as
> > >> > > > > > > well and so far, the default implementation hasn't found a
> > >> need
> > >> > > for a
> > >> > > > > > > similar new method. For public interface, ideally we want
> to
> > >> make
> > >> > > it
> > >> > > > > more
> > >> > > > > > > general.
> > >> > > > > > >
> > >> > > > > > > Thanks,
> > >> > > > > > >
> > >> > > > > > > Jun
> > >> > > > > > >
> > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
> > >> > > > divijvaidya13@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Thank you Jun and Alex for your comments.
> > >> > > > > > > >
> > >> > > > > > > > Point#1: You are right Jun. As Alex mentioned, the
> > "derived
> > >> > > > metadata"
> > >> > > > > > can
> > >> > > > > > > > increase the size of cached metadata by a factor of 10
> but
> > >> it
> > >> > > > should
> > >> > > > > be
> > >> > > > > > > ok
> > >> > > > > > > > to cache just the actual metadata. My point about size
> > >> being a
> > >> > > > > > limitation
> > >> > > > > > > > for using cache is not valid anymore.
> > >> > > > > > > >
> > >> > > > > > > > Point#2: For a new replica, it would still have to fetch
> > the
> > >> > > > metadata
> > >> > > > > > > over
> > >> > > > > > > > the network to initiate the warm up of the cache and
> > hence,
> > >> > > > increase
> > >> > > > > > the
> > >> > > > > > > > start time of the archival process. Please also note the
> > >> > > > > repercussions
> > >> > > > > > of
> > >> > > > > > > > the warm up scan that Alex mentioned in this thread as
> > part
> > >> of
> > >> > > > > #102.2.
> > >> > > > > > > >
> > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that. My point
> > >> about
> > >> > > size
> > >> > > > > > being
> > >> > > > > > > a
> > >> > > > > > > > limitation for using cache is not valid anymore.
> > >> > > > > > > >
> > >> > > > > > > > 101#: Alex, if I understand correctly, you are
> suggesting
> > to
> > >> > > cache
> > >> > > > > the
> > >> > > > > > > > total size at the leader and update it on archival. This
> > >> > wouldn't
> > >> > > > > work
> > >> > > > > > > for
> > >> > > > > > > > cases when the leader restarts where we would have to
> > make a
> > >> > full
> > >> > > > > scan
> > >> > > > > > > > to update the total size entry on startup. We expect
> users
> > >> to
> > >> > > store
> > >> > > > > > data
> > >> > > > > > > > over longer duration in remote storage which increases
> the
> > >> > > > likelihood
> > >> > > > > > of
> > >> > > > > > > > leader restarts / failovers.
> > >> > > > > > > >
> > >> > > > > > > > 102#.1: I don't think that the current design
> accommodates
> > >> the
> > >> > > fact
> > >> > > > > > that
> > >> > > > > > > > data corruption could happen at the RLMM plugin (we
> don't
> > >> have
> > >> > > > > checksum
> > >> > > > > > > as
> > >> > > > > > > > a field in metadata as part of KIP405). If data
> corruption
> > >> > > occurs,
> > >> > > > w/
> > >> > > > > > or
> > >> > > > > > > > w/o the cache, it would be a different problem to
> solve. I
> > >> > would
> > >> > > > like
> > >> > > > > > to
> > >> > > > > > > > keep this outside the scope of this KIP.
> > >> > > > > > > >
> > >> > > > > > > > 102#.2: Agree. This remains as the main concern for
> using
> > >> the
> > >> > > cache
> > >> > > > > to
> > >> > > > > > > > fetch total size.
> > >> > > > > > > >
> > >> > > > > > > > Regards,
> > >> > > > > > > > Divij Vaidya
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre Dupriez <
> > >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Hi Divij,
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks for the KIP. Please find some comments based on
> > >> what I
> > >> > > > read
> > >> > > > > on
> > >> > > > > > > > > this thread so far - apologies for the repeats and the
> > >> late
> > >> > > > reply.
> > >> > > > > > > > >
> > >> > > > > > > > > If I understand correctly, one of the main elements of
> > >> > > discussion
> > >> > > > > is
> > >> > > > > > > > > about caching in Kafka versus delegation of providing
> > the
> > >> > > remote
> > >> > > > > size
> > >> > > > > > > > > of a topic-partition to the plugin.
> > >> > > > > > > > >
> > >> > > > > > > > > A few comments:
> > >> > > > > > > > >
> > >> > > > > > > > > 100. The size of the “derived metadata” which is
> managed
> > >> by
> > >> > the
> > >> > > > > > plugin
> > >> > > > > > > > > to represent an rlmMetadata can indeed be close to 1
> kB
> > on
> > >> > > > average
> > >> > > > > > > > > depending on its own internal structure, e.g. the
> > >> redundancy
> > >> > it
> > >> > > > > > > > > enforces (unfortunately resulting to duplication),
> > >> additional
> > >> > > > > > > > > information such as checksums and primary and
> secondary
> > >> > > indexable
> > >> > > > > > > > > keys. But indeed, the rlmMetadata is itself a lighter
> > data
> > >> > > > > structure
> > >> > > > > > > > > by a factor of 10. And indeed, instead of caching the
> > >> > “derived
> > >> > > > > > > > > metadata”, only the rlmMetadata could be, which should
> > >> > address
> > >> > > > the
> > >> > > > > > > > > concern regarding the memory occupancy of the cache.
> > >> > > > > > > > >
> > >> > > > > > > > > 101. I am not sure I fully understand why we would
> need
> > to
> > >> > > cache
> > >> > > > > the
> > >> > > > > > > > > list of rlmMetadata to retain the remote size of a
> > >> > > > topic-partition.
> > >> > > > > > > > > Since the leader of a topic-partition is, in
> > >> non-degenerated
> > >> > > > cases,
> > >> > > > > > > > > the only actor which can mutate the remote part of the
> > >> > > > > > > > > topic-partition, hence its size, it could in theory
> only
> > >> > cache
> > >> > > > the
> > >> > > > > > > > > size of the remote log once it has calculated it? In
> > which
> > >> > case
> > >> > > > > there
> > >> > > > > > > > > would not be any problem regarding the size of the
> > caching
> > >> > > > > strategy.
> > >> > > > > > > > > Did I miss something there?
> > >> > > > > > > > >
> > >> > > > > > > > > 102. There may be a few challenges to consider with
> > >> caching:
> > >> > > > > > > > >
> > >> > > > > > > > > 102.1) As mentioned above, the caching strategy
> assumes
> > no
> > >> > > > mutation
> > >> > > > > > > > > outside the lifetime of a leader. While this is true
> in
> > >> the
> > >> > > > normal
> > >> > > > > > > > > course of operation, there could be accidental
> mutation
> > >> > outside
> > >> > > > of
> > >> > > > > > the
> > >> > > > > > > > > leader and a loss of consistency between the cached
> > state
> > >> and
> > >> > > the
> > >> > > > > > > > > actual remote representation of the log. E.g.
> > split-brain
> > >> > > > > scenarios,
> > >> > > > > > > > > bugs in the plugins, bugs in external systems with
> > >> mutating
> > >> > > > access
> > >> > > > > on
> > >> > > > > > > > > the derived metadata. In the worst case, a drift
> between
> > >> the
> > >> > > > cached
> > >> > > > > > > > > size and the actual size could lead to over-deleting
> > >> remote
> > >> > > data
> > >> > > > > > which
> > >> > > > > > > > > is a durability risk.
> > >> > > > > > > > >
> > >> > > > > > > > > The alternative you propose, by making the plugin the
> > >> source
> > >> > of
> > >> > > > > truth
> > >> > > > > > > > > w.r.t. to the size of the remote log, can make it
> easier
> > >> to
> > >> > > avoid
> > >> > > > > > > > > inconsistencies between plugin-managed metadata and
> the
> > >> > remote
> > >> > > > log
> > >> > > > > > > > > from the perspective of Kafka. On the other hand,
> plugin
> > >> > > vendors
> > >> > > > > > would
> > >> > > > > > > > > have to implement it with the expected efficiency to
> > have
> > >> it
> > >> > > > yield
> > >> > > > > > > > > benefits.
> > >> > > > > > > > >
> > >> > > > > > > > > 102.2) As you mentioned, the caching strategy in Kafka
> > >> would
> > >> > > > still
> > >> > > > > > > > > require one iteration over the list of rlmMetadata
> when
> > >> the
> > >> > > > > > leadership
> > >> > > > > > > > > of a topic-partition is assigned to a broker, while
> the
> > >> > plugin
> > >> > > > can
> > >> > > > > > > > > offer alternative constant-time approaches. This
> > >> calculation
> > >> > > > cannot
> > >> > > > > > be
> > >> > > > > > > > > put on the LeaderAndIsr path and would be performed in
> > the
> > >> > > > > > background.
> > >> > > > > > > > > In case of bulk leadership migration, listing the
> > >> rlmMetadata
> > >> > > > could
> > >> > > > > > a)
> > >> > > > > > > > > result in request bursts to any backend system the
> > plugin
> > >> may
> > >> > > use
> > >> > > > > > > > > [which shouldn’t be a problem for high-throughput data
> > >> stores
> > >> > > but
> > >> > > > > > > > > could have cost implications] b) increase utilisation
> > >> > timespan
> > >> > > of
> > >> > > > > the
> > >> > > > > > > > > RLM threads for these calculations potentially leading
> > to
> > >> > > > transient
> > >> > > > > > > > > starvation of tasks queued for, typically, offloading
> > >> > > operations
> > >> > > > c)
> > >> > > > > > > > > could have a non-marginal CPU footprint on hardware
> with
> > >> > strict
> > >> > > > > > > > > resource constraints. All these elements could have an
> > >> impact
> > >> > > to
> > >> > > > > some
> > >> > > > > > > > > degree depending on the operational environment.
> > >> > > > > > > > >
> > >> > > > > > > > > From a design perspective, one question is where we
> want
> > >> the
> > >> > > > source
> > >> > > > > > of
> > >> > > > > > > > > truth w.r.t. remote log size to be during the lifetime
> > of
> > >> a
> > >> > > > leader.
> > >> > > > > > > > > The responsibility of maintaining a consistent
> > >> representation
> > >> > > of
> > >> > > > > the
> > >> > > > > > > > > remote log is shared by Kafka and the plugin. Which
> > >> system is
> > >> > > > best
> > >> > > > > > > > > placed to maintain such a state while providing the
> > >> highest
> > >> > > > > > > > > consistency guarantees is something both Kafka and
> > plugin
> > >> > > > designers
> > >> > > > > > > > > could help understand better.
> > >> > > > > > > > >
> > >> > > > > > > > > Many thanks,
> > >> > > > > > > > > Alexandre
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> > >> > <jun@confluent.io.invalid
> > >> > > >
> > >> > > > a
> > >> > > > > > > > écrit :
> > >> > > > > > > > > >
> > >> > > > > > > > > > Hi, Divij,
> > >> > > > > > > > > >
> > >> > > > > > > > > > Thanks for the reply.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Point #1. Is the average remote segment metadata
> > really
> > >> > 1KB?
> > >> > > > > What's
> > >> > > > > > > > > listed
> > >> > > > > > > > > > in the public interface is probably well below 100
> > >> bytes.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Point #2. I guess you are assuming that each broker
> > only
> > >> > > caches
> > >> > > > > the
> > >> > > > > > > > > remote
> > >> > > > > > > > > > segment metadata in memory. An alternative approach
> is
> > >> to
> > >> > > cache
> > >> > > > > > them
> > >> > > > > > > in
> > >> > > > > > > > > > both memory and local disk. That way, on broker
> > restart,
> > >> > you
> > >> > > > just
> > >> > > > > > > need
> > >> > > > > > > > to
> > >> > > > > > > > > > fetch the new remote segments' metadata using the
> > >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> > topicIdPartition,
> > >> > int
> > >> > > > > > > > leaderEpoch)
> > >> > > > > > > > > > api. Will that work?
> > >> > > > > > > > > >
> > >> > > > > > > > > > Point #3. Thanks for the explanation and it sounds
> > good.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Thanks,
> > >> > > > > > > > > >
> > >> > > > > > > > > > Jun
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij Vaidya <
> > >> > > > > > > divijvaidya13@gmail.com>
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Hi Jun
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > There are three points that I would like to
> present
> > >> here:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > 1. We would require a large cache size to
> > efficiently
> > >> > cache
> > >> > > > all
> > >> > > > > > > > segment
> > >> > > > > > > > > > > metadata.
> > >> > > > > > > > > > > 2. Linear scan of all metadata at broker startup
> to
> > >> > > populate
> > >> > > > > the
> > >> > > > > > > > cache
> > >> > > > > > > > > will
> > >> > > > > > > > > > > be slow and will impact the archival process.
> > >> > > > > > > > > > > 3. There is no other use case where a full scan of
> > >> > segment
> > >> > > > > > metadata
> > >> > > > > > > > is
> > >> > > > > > > > > > > required.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Let's start by quantifying 1. Here's my estimate
> for
> > >> the
> > >> > > size
> > >> > > > > of
> > >> > > > > > > the
> > >> > > > > > > > > cache.
> > >> > > > > > > > > > > Average size of segment metadata = 1KB. This could
> > be
> > >> > more
> > >> > > if
> > >> > > > > we
> > >> > > > > > > have
> > >> > > > > > > > > > > frequent leader failover with a large number of
> > leader
> > >> > > epochs
> > >> > > > > > being
> > >> > > > > > > > > stored
> > >> > > > > > > > > > > per segment.
> > >> > > > > > > > > > > Segment size = 100MB. Users will prefer to reduce
> > the
> > >> > > segment
> > >> > > > > > size
> > >> > > > > > > > > from the
> > >> > > > > > > > > > > default value of 1GB to ensure timely archival of
> > data
> > >> > > since
> > >> > > > > data
> > >> > > > > > > > from
> > >> > > > > > > > > > > active segment is not archived.
> > >> > > > > > > > > > > Cache size = num segments * avg. segment metadata
> > >> size =
> > >> > > > > > > > > (100TB/100MB)*1KB
> > >> > > > > > > > > > > = 1GB.
> > >> > > > > > > > > > > While 1GB for cache may not sound like a large
> > number
> > >> for
> > >> > > > > larger
> > >> > > > > > > > > machines,
> > >> > > > > > > > > > > it does eat into the memory as an additional cache
> > and
> > >> > > makes
> > >> > > > > use
> > >> > > > > > > > cases
> > >> > > > > > > > > with
> > >> > > > > > > > > > > large data retention with low throughout expensive
> > >> (where
> > >> > > > such
> > >> > > > > > use
> > >> > > > > > > > case
> > >> > > > > > > > > > > would could use smaller machines).
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > About point#2:
> > >> > > > > > > > > > > Even if we say that all segment metadata can fit
> > into
> > >> the
> > >> > > > > cache,
> > >> > > > > > we
> > >> > > > > > > > > will
> > >> > > > > > > > > > > need to populate the cache on broker startup. It
> > would
> > >> > not
> > >> > > be
> > >> > > > > in
> > >> > > > > > > the
> > >> > > > > > > > > > > critical patch of broker startup and hence won't
> > >> impact
> > >> > the
> > >> > > > > > startup
> > >> > > > > > > > > time.
> > >> > > > > > > > > > > But it will impact the time when we could start
> the
> > >> > > archival
> > >> > > > > > > process
> > >> > > > > > > > > since
> > >> > > > > > > > > > > the RLM thread pool will be blocked on the first
> > call
> > >> to
> > >> > > > > > > > > > > listRemoteLogSegments(). To scan metadata for 1MM
> > >> > segments
> > >> > > > > > > (computed
> > >> > > > > > > > > above)
> > >> > > > > > > > > > > and transfer 1GB data over the network from a RLMM
> > >> such
> > >> > as
> > >> > > a
> > >> > > > > > remote
> > >> > > > > > > > > > > database would be in the order of minutes
> (depending
> > >> on
> > >> > how
> > >> > > > > > > efficient
> > >> > > > > > > > > the
> > >> > > > > > > > > > > scan is with the RLMM implementation). Although, I
> > >> would
> > >> > > > > concede
> > >> > > > > > > that
> > >> > > > > > > > > > > having RLM threads blocked for a few minutes is
> > >> perhaps
> > >> > OK
> > >> > > > but
> > >> > > > > if
> > >> > > > > > > we
> > >> > > > > > > > > > > introduce the new API proposed in the KIP, we
> would
> > >> have
> > >> > a
> > >> > > > > > > > > > > deterministic startup time for RLM. Adding the API
> > >> comes
> > >> > > at a
> > >> > > > > low
> > >> > > > > > > > cost
> > >> > > > > > > > > and
> > >> > > > > > > > > > > I believe the trade off is worth it.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > About point#3:
> > >> > > > > > > > > > > We can use listRemoteLogSegments(TopicIdPartition
> > >> > > > > > topicIdPartition,
> > >> > > > > > > > int
> > >> > > > > > > > > > > leaderEpoch) to calculate the segments eligible
> for
> > >> > > deletion
> > >> > > > > > (based
> > >> > > > > > > > on
> > >> > > > > > > > > size
> > >> > > > > > > > > > > retention) where leader epoch(s) belong to the
> > current
> > >> > > leader
> > >> > > > > > epoch
> > >> > > > > > > > > chain.
> > >> > > > > > > > > > > I understand that it may lead to segments
> belonging
> > to
> > >> > > other
> > >> > > > > > epoch
> > >> > > > > > > > > lineage
> > >> > > > > > > > > > > not getting deleted and would require a separate
> > >> > mechanism
> > >> > > to
> > >> > > > > > > delete
> > >> > > > > > > > > them.
> > >> > > > > > > > > > > The separate mechanism would anyways be required
> to
> > >> > delete
> > >> > > > > these
> > >> > > > > > > > > "leaked"
> > >> > > > > > > > > > > segments as there are other cases which could lead
> > to
> > >> > leaks
> > >> > > > > such
> > >> > > > > > as
> > >> > > > > > > > > network
> > >> > > > > > > > > > > problems with RSM mid way writing through. segment
> > >> etc.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Thank you for the replies so far. They have made
> me
> > >> > > re-think
> > >> > > > my
> > >> > > > > > > > > assumptions
> > >> > > > > > > > > > > and this dialogue has been very constructive for
> me.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Regards,
> > >> > > > > > > > > > > Divij Vaidya
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
> > >> > > > > > <jun@confluent.io.invalid
> > >> > > > > > > >
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > Hi, Divij,
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Thanks for the reply.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > It's true that the data in Kafka could be kept
> > >> longer
> > >> > > with
> > >> > > > > > > KIP-405.
> > >> > > > > > > > > How
> > >> > > > > > > > > > > > much data do you envision to have per broker?
> For
> > >> 100TB
> > >> > > > data
> > >> > > > > > per
> > >> > > > > > > > > broker,
> > >> > > > > > > > > > > > with 1GB segment and segment metadata of 100
> > bytes,
> > >> it
> > >> > > > > requires
> > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit in
> memory.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > RemoteLogMetadataManager has two
> > >> > listRemoteLogSegments()
> > >> > > > > > methods.
> > >> > > > > > > > > The one
> > >> > > > > > > > > > > > you listed
> listRemoteLogSegments(TopicIdPartition
> > >> > > > > > > topicIdPartition,
> > >> > > > > > > > > int
> > >> > > > > > > > > > > > leaderEpoch) does return data in offset order.
> > >> However,
> > >> > > the
> > >> > > > > > other
> > >> > > > > > > > > > > > one listRemoteLogSegments(TopicIdPartition
> > >> > > > topicIdPartition)
> > >> > > > > > > > doesn't
> > >> > > > > > > > > > > > specify the return order. I assume that you need
> > the
> > >> > > latter
> > >> > > > > to
> > >> > > > > > > > > calculate
> > >> > > > > > > > > > > > the segment size?
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Thanks,
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Jun
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij Vaidya <
> > >> > > > > > > > > divijvaidya13@gmail.com>
> > >> > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > > *Jun,*
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > *"the default implementation of RLMM does
> local
> > >> > > caching,
> > >> > > > > > > right?"*
> > >> > > > > > > > > > > > > Yes, Jun. The default implementation of RLMM
> > does
> > >> > > indeed
> > >> > > > > > cache
> > >> > > > > > > > the
> > >> > > > > > > > > > > > segment
> > >> > > > > > > > > > > > > metadata today, hence, it won't work for use
> > cases
> > >> > when
> > >> > > > the
> > >> > > > > > > > number
> > >> > > > > > > > > of
> > >> > > > > > > > > > > > > segments in remote storage is large enough to
> > >> exceed
> > >> > > the
> > >> > > > > size
> > >> > > > > > > of
> > >> > > > > > > > > cache.
> > >> > > > > > > > > > > > As
> > >> > > > > > > > > > > > > part of this KIP, I will implement the new
> > >> proposed
> > >> > API
> > >> > > > in
> > >> > > > > > the
> > >> > > > > > > > > default
> > >> > > > > > > > > > > > > implementation of RLMM but the underlying
> > >> > > implementation
> > >> > > > > will
> > >> > > > > > > > > still be
> > >> > > > > > > > > > > a
> > >> > > > > > > > > > > > > scan. I will pick up optimizing that in a
> > separate
> > >> > PR.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > *"we also cache all segment metadata in the
> > >> brokers
> > >> > > > without
> > >> > > > > > > > > KIP-405. Do
> > >> > > > > > > > > > > > you
> > >> > > > > > > > > > > > > see a need to change that?"*
> > >> > > > > > > > > > > > > Please correct me if I am wrong here but we
> > cache
> > >> > > > metadata
> > >> > > > > > for
> > >> > > > > > > > > segments
> > >> > > > > > > > > > > > > "residing in local storage". The size of the
> > >> current
> > >> > > > cache
> > >> > > > > > > works
> > >> > > > > > > > > fine
> > >> > > > > > > > > > > for
> > >> > > > > > > > > > > > > the scale of the number of segments that we
> > >> expect to
> > >> > > > store
> > >> > > > > > in
> > >> > > > > > > > > local
> > >> > > > > > > > > > > > > storage. After KIP-405, that cache will
> continue
> > >> to
> > >> > > store
> > >> > > > > > > > metadata
> > >> > > > > > > > > for
> > >> > > > > > > > > > > > > segments which are residing in local storage
> and
> > >> > hence,
> > >> > > > we
> > >> > > > > > > don't
> > >> > > > > > > > > need
> > >> > > > > > > > > > > to
> > >> > > > > > > > > > > > > change that. For segments which have been
> > >> offloaded
> > >> > to
> > >> > > > > remote
> > >> > > > > > > > > storage,
> > >> > > > > > > > > > > it
> > >> > > > > > > > > > > > > would rely on RLMM. Note that the scale of
> data
> > >> > stored
> > >> > > in
> > >> > > > > > RLMM
> > >> > > > > > > is
> > >> > > > > > > > > > > > different
> > >> > > > > > > > > > > > > from local cache because the number of
> segments
> > is
> > >> > > > expected
> > >> > > > > > to
> > >> > > > > > > be
> > >> > > > > > > > > much
> > >> > > > > > > > > > > > > larger than what current implementation stores
> > in
> > >> > local
> > >> > > > > > > storage.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > 2,3,4:
> > >> > RemoteLogMetadataManager.listRemoteLogSegments()
> > >> > > > > does
> > >> > > > > > > > > specify
> > >> > > > > > > > > > > the
> > >> > > > > > > > > > > > > order i.e. it returns the segments sorted by
> > first
> > >> > > offset
> > >> > > > > in
> > >> > > > > > > > > ascending
> > >> > > > > > > > > > > > > order. I am copying the API docs for KIP-405
> > here
> > >> for
> > >> > > > your
> > >> > > > > > > > > reference
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > *Returns iterator of remote log segment
> > metadata,
> > >> > > sorted
> > >> > > > by
> > >> > > > > > > > {@link
> > >> > > > > > > > > > > > > RemoteLogSegmentMetadata#startOffset()}
> > >> inascending
> > >> > > order
> > >> > > > > > which
> > >> > > > > > > > > > > contains
> > >> > > > > > > > > > > > > the given leader epoch. This is used by remote
> > log
> > >> > > > > retention
> > >> > > > > > > > > management
> > >> > > > > > > > > > > > > subsystemto fetch the segment metadata for a
> > given
> > >> > > leader
> > >> > > > > > > > > epoch.@param
> > >> > > > > > > > > > > > > topicIdPartition topic partition@param
> > >> leaderEpoch
> > >> > > > > > leader
> > >> > > > > > > > > > > > > epoch@return
> > >> > > > > > > > > > > > > Iterator of remote segments, sorted by start
> > >> offset
> > >> > in
> > >> > > > > > > ascending
> > >> > > > > > > > > > > order. *
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > *Luke,*
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > 5. Note that we are trying to optimize the
> > >> efficiency
> > >> > > of
> > >> > > > > size
> > >> > > > > > > > based
> > >> > > > > > > > > > > > > retention for remote storage. KIP-405 does not
> > >> > > introduce
> > >> > > > a
> > >> > > > > > new
> > >> > > > > > > > > config
> > >> > > > > > > > > > > for
> > >> > > > > > > > > > > > > periodically checking remote similar to
> > >> > > > > > > > > > > log.retention.check.interval.ms
> > >> > > > > > > > > > > > > which is applicable for remote storage. Hence,
> > the
> > >> > > metric
> > >> > > > > > will
> > >> > > > > > > be
> > >> > > > > > > > > > > updated
> > >> > > > > > > > > > > > > at the time of invoking log retention check
> for
> > >> > remote
> > >> > > > tier
> > >> > > > > > > which
> > >> > > > > > > > > is
> > >> > > > > > > > > > > > > pending implementation today. We can perhaps
> > come
> > >> > back
> > >> > > > and
> > >> > > > > > > update
> > >> > > > > > > > > the
> > >> > > > > > > > > > > > > metric description after the implementation of
> > log
> > >> > > > > retention
> > >> > > > > > > > check
> > >> > > > > > > > > in
> > >> > > > > > > > > > > > > RemoteLogManager.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > Divij Vaidya
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke Chen <
> > >> > > > > showuon@gmail.com
> > >> > > > > > >
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Hi Divij,
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > One more question about the metric:
> > >> > > > > > > > > > > > > > I think the metric will be updated when
> > >> > > > > > > > > > > > > > (1) each time we run the log retention check
> > >> (that
> > >> > > is,
> > >> > > > > > > > > > > > > > log.retention.check.interval.ms)
> > >> > > > > > > > > > > > > > (2) When user explicitly call
> getRemoteLogSize
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Is that correct?
> > >> > > > > > > > > > > > > > Maybe we should add a note in metric
> > >> description,
> > >> > > > > > otherwise,
> > >> > > > > > > > when
> > >> > > > > > > > > > > user
> > >> > > > > > > > > > > > > got,
> > >> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes, will be
> > >> > surprised.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Otherwise, LGTM
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Thank you for the KIP
> > >> > > > > > > > > > > > > > Luke
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun Rao
> > >> > > > > > > > <jun@confluent.io.invalid
> > >> > > > > > > > > >
> > >> > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Hi, Divij,
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Thanks for the explanation.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > 1. Hmm, the default implementation of RLMM
> > >> does
> > >> > > local
> > >> > > > > > > > caching,
> > >> > > > > > > > > > > right?
> > >> > > > > > > > > > > > > > > Currently, we also cache all segment
> > metadata
> > >> in
> > >> > > the
> > >> > > > > > > brokers
> > >> > > > > > > > > > > without
> > >> > > > > > > > > > > > > > > KIP-405. Do you see a need to change that?
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation makes sense.
> > >> > However,
> > >> > > > > > > > > > > > > > > currently,
> > >> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > >> > > > > > > > > doesn't
> > >> > > > > > > > > > > > > > specify
> > >> > > > > > > > > > > > > > > a particular order of the iterator. Do you
> > >> intend
> > >> > > to
> > >> > > > > > change
> > >> > > > > > > > > that?
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Thanks,
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Jun
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM Divij
> Vaidya
> > <
> > >> > > > > > > > > > > divijvaidya13@gmail.com
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Hey Jun
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Thank you for your comments.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > *1. "RLMM implementor could ensure that
> > >> > > > > > > > > listRemoteLogSegments()
> > >> > > > > > > > > > > is
> > >> > > > > > > > > > > > > > fast"*
> > >> > > > > > > > > > > > > > > > This would be ideal but pragmatically,
> it
> > is
> > >> > > > > difficult
> > >> > > > > > to
> > >> > > > > > > > > ensure
> > >> > > > > > > > > > > > that
> > >> > > > > > > > > > > > > > > > listRemoteLogSegments() is fast. This is
> > >> > because
> > >> > > of
> > >> > > > > the
> > >> > > > > > > > > > > possibility
> > >> > > > > > > > > > > > > of
> > >> > > > > > > > > > > > > > a
> > >> > > > > > > > > > > > > > > > large number of segments (much larger
> than
> > >> what
> > >> > > > Kafka
> > >> > > > > > > > > currently
> > >> > > > > > > > > > > > > handles
> > >> > > > > > > > > > > > > > > > with local storage today) would make it
> > >> > > infeasible
> > >> > > > to
> > >> > > > > > > adopt
> > >> > > > > > > > > > > > > strategies
> > >> > > > > > > > > > > > > > > such
> > >> > > > > > > > > > > > > > > > as local caching to improve the
> > performance
> > >> of
> > >> > > > > > > > > > > > listRemoteLogSegments.
> > >> > > > > > > > > > > > > > > Apart
> > >> > > > > > > > > > > > > > > > from caching (which won't work due to
> size
> > >> > > > > > limitations) I
> > >> > > > > > > > > can't
> > >> > > > > > > > > > > > think
> > >> > > > > > > > > > > > > > of
> > >> > > > > > > > > > > > > > > > other strategies which may eliminate the
> > >> need
> > >> > for
> > >> > > > IO
> > >> > > > > > > > > > > > > > > > operations proportional to the number of
> > >> total
> > >> > > > > > segments.
> > >> > > > > > > > > Please
> > >> > > > > > > > > > > > > advise
> > >> > > > > > > > > > > > > > if
> > >> > > > > > > > > > > > > > > > you have something in mind.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds the retention
> > >> size,
> > >> > we
> > >> > > > need
> > >> > > > > > to
> > >> > > > > > > > > > > determine
> > >> > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > subset of segments to delete to bring
> the
> > >> size
> > >> > > > within
> > >> > > > > > the
> > >> > > > > > > > > > > retention
> > >> > > > > > > > > > > > > > > limit.
> > >> > > > > > > > > > > > > > > > Do we need to call
> > >> > > > > > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > >> > > > > > > > > > > > > to
> > >> > > > > > > > > > > > > > > > determine that?"*
> > >> > > > > > > > > > > > > > > > Yes, we need to call
> > >> listRemoteLogSegments() to
> > >> > > > > > determine
> > >> > > > > > > > > which
> > >> > > > > > > > > > > > > > segments
> > >> > > > > > > > > > > > > > > > should be deleted. But there is a
> > difference
> > >> > with
> > >> > > > the
> > >> > > > > > use
> > >> > > > > > > > > case we
> > >> > > > > > > > > > > > are
> > >> > > > > > > > > > > > > > > > trying to optimize with this KIP. To
> > >> determine
> > >> > > the
> > >> > > > > > subset
> > >> > > > > > > > of
> > >> > > > > > > > > > > > segments
> > >> > > > > > > > > > > > > > > which
> > >> > > > > > > > > > > > > > > > would be deleted, we only read metadata
> > for
> > >> > > > segments
> > >> > > > > > > which
> > >> > > > > > > > > would
> > >> > > > > > > > > > > be
> > >> > > > > > > > > > > > > > > deleted
> > >> > > > > > > > > > > > > > > > via the listRemoteLogSegments(). But to
> > >> > determine
> > >> > > > the
> > >> > > > > > > > > > > totalLogSize,
> > >> > > > > > > > > > > > > > which
> > >> > > > > > > > > > > > > > > > is required every time retention logic
> > >> based on
> > >> > > > size
> > >> > > > > > > > > executes, we
> > >> > > > > > > > > > > > > read
> > >> > > > > > > > > > > > > > > > metadata of *all* the segments in remote
> > >> > storage.
> > >> > > > > > Hence,
> > >> > > > > > > > the
> > >> > > > > > > > > > > number
> > >> > > > > > > > > > > > > of
> > >> > > > > > > > > > > > > > > > results returned by
> > >> > > > > > > > > > > >
> *RemoteLogMetadataManager.listRemoteLogSegments()
> > >> > > > > > > > > > > > > > *is
> > >> > > > > > > > > > > > > > > > different when we are calculating
> > >> totalLogSize
> > >> > > vs.
> > >> > > > > when
> > >> > > > > > > we
> > >> > > > > > > > > are
> > >> > > > > > > > > > > > > > > determining
> > >> > > > > > > > > > > > > > > > the subset of segments to delete.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > 3.
> > >> > > > > > > > > > > > > > > > *"Also, what about time-based retention?
> > To
> > >> > make
> > >> > > > that
> > >> > > > > > > > > efficient,
> > >> > > > > > > > > > > do
> > >> > > > > > > > > > > > > we
> > >> > > > > > > > > > > > > > > need
> > >> > > > > > > > > > > > > > > > to make some additional interface
> > >> changes?"*No.
> > >> > > > Note
> > >> > > > > > that
> > >> > > > > > > > > time
> > >> > > > > > > > > > > > > > complexity
> > >> > > > > > > > > > > > > > > > to determine the segments for retention
> is
> > >> > > > different
> > >> > > > > > for
> > >> > > > > > > > time
> > >> > > > > > > > > > > based
> > >> > > > > > > > > > > > > vs.
> > >> > > > > > > > > > > > > > > > size based. For time based, the time
> > >> complexity
> > >> > > is
> > >> > > > a
> > >> > > > > > > > > function of
> > >> > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > number
> > >> > > > > > > > > > > > > > > > of segments which are "eligible for
> > >> deletion"
> > >> > > > (since
> > >> > > > > we
> > >> > > > > > > > only
> > >> > > > > > > > > read
> > >> > > > > > > > > > > > > > > metadata
> > >> > > > > > > > > > > > > > > > for segments which would be deleted)
> > >> whereas in
> > >> > > > size
> > >> > > > > > > based
> > >> > > > > > > > > > > > retention,
> > >> > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > time complexity is a function of "all
> > >> segments"
> > >> > > > > > available
> > >> > > > > > > > in
> > >> > > > > > > > > > > remote
> > >> > > > > > > > > > > > > > > storage
> > >> > > > > > > > > > > > > > > > (metadata of all segments needs to be
> read
> > >> to
> > >> > > > > calculate
> > >> > > > > > > the
> > >> > > > > > > > > total
> > >> > > > > > > > > > > > > > size).
> > >> > > > > > > > > > > > > > > As
> > >> > > > > > > > > > > > > > > > you may observe, this KIP will bring the
> > >> time
> > >> > > > > > complexity
> > >> > > > > > > > for
> > >> > > > > > > > > both
> > >> > > > > > > > > > > > > time
> > >> > > > > > > > > > > > > > > > based retention & size based retention
> to
> > >> the
> > >> > > same
> > >> > > > > > > > function.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > 4. Also, please note that this new API
> > >> > introduced
> > >> > > > in
> > >> > > > > > this
> > >> > > > > > > > KIP
> > >> > > > > > > > > > > also
> > >> > > > > > > > > > > > > > > enables
> > >> > > > > > > > > > > > > > > > us to provide a metric for total size of
> > >> data
> > >> > > > stored
> > >> > > > > in
> > >> > > > > > > > > remote
> > >> > > > > > > > > > > > > storage.
> > >> > > > > > > > > > > > > > > > Without the API, calculation of this
> > metric
> > >> > will
> > >> > > > > become
> > >> > > > > > > > very
> > >> > > > > > > > > > > > > expensive
> > >> > > > > > > > > > > > > > > with
> > >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> > >> > > > > > > > > > > > > > > > I understand that your motivation here
> is
> > to
> > >> > > avoid
> > >> > > > > > > > polluting
> > >> > > > > > > > > the
> > >> > > > > > > > > > > > > > > interface
> > >> > > > > > > > > > > > > > > > with optimization specific APIs and I
> will
> > >> > agree
> > >> > > > with
> > >> > > > > > > that
> > >> > > > > > > > > goal.
> > >> > > > > > > > > > > > But
> > >> > > > > > > > > > > > > I
> > >> > > > > > > > > > > > > > > > believe that this new API proposed in
> the
> > >> KIP
> > >> > > > brings
> > >> > > > > in
> > >> > > > > > > > > > > significant
> > >> > > > > > > > > > > > > > > > improvement and there is no other work
> > >> around
> > >> > > > > available
> > >> > > > > > > to
> > >> > > > > > > > > > > achieve
> > >> > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > same
> > >> > > > > > > > > > > > > > > > performance.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Regards,
> > >> > > > > > > > > > > > > > > > Divij Vaidya
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM Jun Rao
> > >> > > > > > > > > <jun@confluent.io.invalid
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Hi, Divij,
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry for the late
> > >> reply.
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > The motivation of the KIP is to
> improve
> > >> the
> > >> > > > > > efficiency
> > >> > > > > > > of
> > >> > > > > > > > > size
> > >> > > > > > > > > > > > > based
> > >> > > > > > > > > > > > > > > > > retention. I am not sure the proposed
> > >> changes
> > >> > > are
> > >> > > > > > > enough.
> > >> > > > > > > > > For
> > >> > > > > > > > > > > > > > example,
> > >> > > > > > > > > > > > > > > if
> > >> > > > > > > > > > > > > > > > > the size exceeds the retention size,
> we
> > >> need
> > >> > to
> > >> > > > > > > determine
> > >> > > > > > > > > the
> > >> > > > > > > > > > > > > subset
> > >> > > > > > > > > > > > > > of
> > >> > > > > > > > > > > > > > > > > segments to delete to bring the size
> > >> within
> > >> > the
> > >> > > > > > > retention
> > >> > > > > > > > > > > limit.
> > >> > > > > > > > > > > > Do
> > >> > > > > > > > > > > > > > we
> > >> > > > > > > > > > > > > > > > need
> > >> > > > > > > > > > > > > > > > > to call
> > >> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > >> > > > > > > > to
> > >> > > > > > > > > > > > > determine
> > >> > > > > > > > > > > > > > > > that?
> > >> > > > > > > > > > > > > > > > > Also, what about time-based retention?
> > To
> > >> > make
> > >> > > > that
> > >> > > > > > > > > efficient,
> > >> > > > > > > > > > > do
> > >> > > > > > > > > > > > > we
> > >> > > > > > > > > > > > > > > need
> > >> > > > > > > > > > > > > > > > > to make some additional interface
> > changes?
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > An alternative approach is for the
> RLMM
> > >> > > > implementor
> > >> > > > > > to
> > >> > > > > > > > make
> > >> > > > > > > > > > > sure
> > >> > > > > > > > > > > > > > > > > that
> > >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > >> > > > > > > is
> > >> > > > > > > > > fast
> > >> > > > > > > > > > > > > (e.g.,
> > >> > > > > > > > > > > > > > > with
> > >> > > > > > > > > > > > > > > > > local caching). This way, we could
> keep
> > >> the
> > >> > > > > interface
> > >> > > > > > > > > simple.
> > >> > > > > > > > > > > > Have
> > >> > > > > > > > > > > > > we
> > >> > > > > > > > > > > > > > > > > considered that?
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Thanks,
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Jun
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28 AM Divij
> > >> Vaidya
> > >> > <
> > >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > >> > > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > Hey folks
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > Does anyone else have any thoughts
> on
> > >> this
> > >> > > > > before I
> > >> > > > > > > > > propose
> > >> > > > > > > > > > > > this
> > >> > > > > > > > > > > > > > for
> > >> > > > > > > > > > > > > > > a
> > >> > > > > > > > > > > > > > > > > > vote?
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > > Divij Vaidya
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57 PM
> Satish
> > >> > > Duggana
> > >> > > > <
> > >> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP Divij!
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > This is a nice improvement to
> avoid
> > >> > > > > recalculation
> > >> > > > > > > of
> > >> > > > > > > > > size.
> > >> > > > > > > > > > > > > > > Customized
> > >> > > > > > > > > > > > > > > > > > RLMMs
> > >> > > > > > > > > > > > > > > > > > > can implement the best possible
> > >> approach
> > >> > by
> > >> > > > > > caching
> > >> > > > > > > > or
> > >> > > > > > > > > > > > > > maintaining
> > >> > > > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > > > size
> > >> > > > > > > > > > > > > > > > > > > in an efficient way. But this is
> > not a
> > >> > big
> > >> > > > > > concern
> > >> > > > > > > > for
> > >> > > > > > > > > the
> > >> > > > > > > > > > > > > > default
> > >> > > > > > > > > > > > > > > > > topic
> > >> > > > > > > > > > > > > > > > > > > based RLMM as mentioned in the
> KIP.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > ~Satish.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at 18:48,
> Divij
> > >> > Vaidya
> > >> > > <
> > >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> > >> > > > > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > Thank you for your review Luke.
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > Reg: is that would the new
> > >> > > > > > `RemoteLogSizeBytes`
> > >> > > > > > > > > metric
> > >> > > > > > > > > > > > be a
> > >> > > > > > > > > > > > > > > > > > performance
> > >> > > > > > > > > > > > > > > > > > > > overhead? Although we move the
> > >> > > calculation
> > >> > > > > to a
> > >> > > > > > > > > seperate
> > >> > > > > > > > > > > > API,
> > >> > > > > > > > > > > > > > we
> > >> > > > > > > > > > > > > > > > > still
> > >> > > > > > > > > > > > > > > > > > > > can't assume users will
> implement
> > a
> > >> > > > > > light-weight
> > >> > > > > > > > > method,
> > >> > > > > > > > > > > > > right?
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > This metric would be logged
> using
> > >> the
> > >> > > > > > information
> > >> > > > > > > > > that is
> > >> > > > > > > > > > > > > > already
> > >> > > > > > > > > > > > > > > > > being
> > >> > > > > > > > > > > > > > > > > > > > calculated for handling remote
> > >> > retention
> > >> > > > > logic,
> > >> > > > > > > > > hence, no
> > >> > > > > > > > > > > > > > > > additional
> > >> > > > > > > > > > > > > > > > > > work
> > >> > > > > > > > > > > > > > > > > > > > is required to calculate this
> > >> metric.
> > >> > > More
> > >> > > > > > > > > specifically,
> > >> > > > > > > > > > > > > > whenever
> > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls
> > >> getRemoteLogSize
> > >> > > > API,
> > >> > > > > > this
> > >> > > > > > > > > metric
> > >> > > > > > > > > > > > > would
> > >> > > > > > > > > > > > > > be
> > >> > > > > > > > > > > > > > > > > > > captured.
> > >> > > > > > > > > > > > > > > > > > > > This API call is made every time
> > >> > > > > > RemoteLogManager
> > >> > > > > > > > > wants
> > >> > > > > > > > > > > to
> > >> > > > > > > > > > > > > > handle
> > >> > > > > > > > > > > > > > > > > > expired
> > >> > > > > > > > > > > > > > > > > > > > remote log segments (which
> should
> > be
> > >> > > > > periodic).
> > >> > > > > > > > Does
> > >> > > > > > > > > that
> > >> > > > > > > > > > > > > > address
> > >> > > > > > > > > > > > > > > > > your
> > >> > > > > > > > > > > > > > > > > > > > concern?
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at 11:01 AM
> > >> Luke
> > >> > > Chen
> > >> > > > <
> > >> > > > > > > > > > > > > showuon@gmail.com>
> > >> > > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP!
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > I think it makes sense to
> > delegate
> > >> > the
> > >> > > > > > > > > responsibility
> > >> > > > > > > > > > > of
> > >> > > > > > > > > > > > > > > > > calculation
> > >> > > > > > > > > > > > > > > > > > to
> > >> > > > > > > > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > > > > > > specific
> > RemoteLogMetadataManager
> > >> > > > > > > implementation.
> > >> > > > > > > > > > > > > > > > > > > > > But one thing I'm not quite
> > sure,
> > >> is
> > >> > > that
> > >> > > > > > would
> > >> > > > > > > > > the new
> > >> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes` metric
> be a
> > >> > > > > performance
> > >> > > > > > > > > overhead?
> > >> > > > > > > > > > > > > > > > > > > > > Although we move the
> calculation
> > >> to a
> > >> > > > > > seperate
> > >> > > > > > > > > API, we
> > >> > > > > > > > > > > > > still
> > >> > > > > > > > > > > > > > > > can't
> > >> > > > > > > > > > > > > > > > > > > assume
> > >> > > > > > > > > > > > > > > > > > > > > users will implement a
> > >> light-weight
> > >> > > > method,
> > >> > > > > > > > right?
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > Thank you.
> > >> > > > > > > > > > > > > > > > > > > > > Luke
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at 5:47 PM
> > >> Divij
> > >> > > > > Vaidya <
> > >> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > Please take a look at this
> KIP
> > >> > which
> > >> > > > > > proposes
> > >> > > > > > > > an
> > >> > > > > > > > > > > > > extension
> > >> > > > > > > > > > > > > > to
> > >> > > > > > > > > > > > > > > > > > > KIP-405.
> > >> > > > > > > > > > > > > > > > > > > > > This
> > >> > > > > > > > > > > > > > > > > > > > > > is my first KIP with Apache
> > >> Kafka
> > >> > > > > community
> > >> > > > > > > so
> > >> > > > > > > > > any
> > >> > > > > > > > > > > > > feedback
> > >> > > > > > > > > > > > > > > > would
> > >> > > > > > > > > > > > > > > > > > be
> > >> > > > > > > > > > > > > > > > > > > > > highly
> > >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software Engineer
> > >> > > > > > > > > > > > > > > > > > > > > > Amazon
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Kamal Chandraprakash <ka...@gmail.com>.
Hi Divij,

Thanks for the KIP! Sorry for the late reply.

Can you explain the rejected alternative-3?
Store the cumulative size of remote tier log in-memory at RemoteLogManager
"*Cons*: Every time a broker starts-up, it will scan through all the
segments in the remote tier to initialise the in-memory value. This would
increase the broker start-up time."

Keeping the source of truth to determine the remote-log-size in the leader
would be consistent across different implementations of the plugin. The
concern posted in the KIP is that we are calculating the remote-log-size on
each iteration of the cleaner thread (say 5 mins). If we calculate only
once during broker startup or during the leadership reassignment, do we
still need the cache?

The broker startup-time won't be affected by the remote log manager
initialisation. The broker continue to start accepting the new
produce/fetch requests, while the RLM thread in the background can
determine the remote-log-size once and start copying/deleting the segments.

Thanks,
Kamal

On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <di...@gmail.com> wrote:

> Satish / Jun
>
> Do you have any thoughts on this?
>
> --
> Divij Vaidya
>
>
>
> On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <di...@gmail.com>
> wrote:
>
> > Hey Jun
> >
> > It has been a while since this KIP got some attention. While we wait for
> > Satish to chime in here, perhaps I can answer your question.
> >
> > > Could you explain how you exposed the log size in your KIP-405
> > implementation?
> >
> > The APIs available in RLMM as per KIP405
> > are, addRemoteLogSegmentMetadata(), updateRemoteLogSegmentMetadata(),
> remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> putRemotePartitionDeleteMetadata(), listRemoteLogSegments(),
> onPartitionLeadershipChanges()
> > and onStopPartitions(). None of these APIs allow us to expose the log
> size,
> > hence, the only option that remains is to list all segments using
> > listRemoteLogSegments() and aggregate them every time we require to
> > calculate the size. Based on our prior discussion, this requires reading
> > all segment metadata which won't work for non-local RLMM implementations.
> > Satish's implementation also performs a full scan and calculates the
> > aggregate. see:
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> >
> >
> > Does this answer your question?
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao <ju...@confluent.io.invalid>
> wrote:
> >
> >> Hi, Divij,
> >>
> >> Thanks for the explanation.
> >>
> >> Good question.
> >>
> >> Hi, Satish,
> >>
> >> Could you explain how you exposed the log size in your KIP-405
> >> implementation?
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <di...@gmail.com>
> >> wrote:
> >>
> >> > Hey Jun
> >> >
> >> > Yes, it is possible to maintain the log size in the cache (see
> rejected
> >> > alternative#3 in the KIP) but I did not understand how it is possible
> to
> >> > retrieve it without the new API. The log size could be calculated on
> >> > startup by scanning through the segments (though I would disagree that
> >> this
> >> > is the right approach since scanning itself takes order of minutes and
> >> > hence delay the start of archive process), and incrementally
> maintained
> >> > afterwards, even then, we would need an API in
> RemoteLogMetadataManager
> >> so
> >> > that RLM could fetch the cached size!
> >> >
> >> > If we wish to cache the size without adding a new API, then we need to
> >> > cache the size in RLM itself (instead of RLMM implementation) and
> >> > incrementally manage it. The downside of longer archive time at
> startup
> >> > (due to initial scale) still remains valid in this situation.
> >> >
> >> > --
> >> > Divij Vaidya
> >> >
> >> >
> >> >
> >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao <ju...@confluent.io.invalid>
> >> wrote:
> >> >
> >> > > Hi, Divij,
> >> > >
> >> > > Thanks for the explanation.
> >> > >
> >> > > If there is in-memory cache, could we maintain the log size in the
> >> cache
> >> > > with the existing API? For example, a replica could make a
> >> > > listRemoteLogSegments(TopicIdPartition topicIdPartition) call on
> >> startup
> >> > to
> >> > > get the remote segment size before the current leaderEpoch. The
> leader
> >> > > could then maintain the size incrementally afterwards. On leader
> >> change,
> >> > > other replicas can make a listRemoteLogSegments(TopicIdPartition
> >> > > topicIdPartition, int leaderEpoch) call to get the size of newly
> >> > generated
> >> > > segments.
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Jun
> >> > >
> >> > >
> >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> divijvaidya13@gmail.com
> >> >
> >> > > wrote:
> >> > >
> >> > > > > Is the new method enough for doing size-based retention?
> >> > > >
> >> > > > Yes. You are right in assuming that this API only provides the
> >> Remote
> >> > > > storage size (for current epoch chain). We would use this API for
> >> size
> >> > > > based retention along with a value of localOnlyLogSegmentSize
> which
> >> is
> >> > > > computed as Log.sizeInBytes(logSegments.filter(_.baseOffset >
> >> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size =
> >> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I have updated
> >> the
> >> > KIP
> >> > > > with this information. You can also check an example
> implementation
> >> at
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> >> > > >
> >> > > >
> >> > > > > Do you imagine all accesses to remote metadata will be across
> the
> >> > > network
> >> > > > or will there be some local in-memory cache?
> >> > > >
> >> > > > I would expect a disk-less implementation to maintain a finite
> >> > in-memory
> >> > > > cache for segment metadata to optimize the number of network calls
> >> made
> >> > > to
> >> > > > fetch the data. In future, we can think about bringing this finite
> >> size
> >> > > > cache into RLM itself but that's probably a conversation for a
> >> > different
> >> > > > KIP. There are many other things we would like to do to optimize
> the
> >> > > Tiered
> >> > > > storage interface such as introducing a circular buffer /
> streaming
> >> > > > interface from RSM (so that we don't have to wait to fetch the
> >> entire
> >> > > > segment before starting to send records to the consumer), caching
> >> the
> >> > > > segments fetched from RSM locally (I would assume all RSM plugin
> >> > > > implementations to do this, might as well add it to RLM) etc.
> >> > > >
> >> > > > --
> >> > > > Divij Vaidya
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao <jun@confluent.io.invalid
> >
> >> > > wrote:
> >> > > >
> >> > > > > Hi, Divij,
> >> > > > >
> >> > > > > Thanks for the reply.
> >> > > > >
> >> > > > > Is the new method enough for doing size-based retention? It
> gives
> >> the
> >> > > > total
> >> > > > > size of the remote segments, but it seems that we still don't
> know
> >> > the
> >> > > > > exact total size for a log since there could be overlapping
> >> segments
> >> > > > > between the remote and the local segments.
> >> > > > >
> >> > > > > You mentioned a disk-less implementation. Do you imagine all
> >> accesses
> >> > > to
> >> > > > > remote metadata will be across the network or will there be some
> >> > local
> >> > > > > in-memory cache?
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Jun
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> >> divijvaidya13@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > The method is needed for RLMM implementations which fetch the
> >> > > > information
> >> > > > > > over the network and not for the disk based implementations
> >> (such
> >> > as
> >> > > > the
> >> > > > > > default topic based RLMM).
> >> > > > > >
> >> > > > > > I would argue that adding this API makes the interface more
> >> generic
> >> > > > than
> >> > > > > > what it is today. This is because, with the current APIs an
> >> > > implementor
> >> > > > > is
> >> > > > > > restricted to use disk based RLMM solutions only (i.e. the
> >> default
> >> > > > > > solution) whereas if we add this new API, we unblock usage of
> >> > network
> >> > > > > based
> >> > > > > > RLMM implementations such as databases.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> <jun@confluent.io.invalid
> >> >
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > Hi, Divij,
> >> > > > > > >
> >> > > > > > > Thanks for the reply.
> >> > > > > > >
> >> > > > > > > Point#2. My high level question is that is the new method
> >> needed
> >> > > for
> >> > > > > > every
> >> > > > > > > implementation of remote storage or just for a specific
> >> > > > implementation.
> >> > > > > > The
> >> > > > > > > issues that you pointed out exist for the default
> >> implementation
> >> > of
> >> > > > > RLMM
> >> > > > > > as
> >> > > > > > > well and so far, the default implementation hasn't found a
> >> need
> >> > > for a
> >> > > > > > > similar new method. For public interface, ideally we want to
> >> make
> >> > > it
> >> > > > > more
> >> > > > > > > general.
> >> > > > > > >
> >> > > > > > > Thanks,
> >> > > > > > >
> >> > > > > > > Jun
> >> > > > > > >
> >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
> >> > > > divijvaidya13@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Thank you Jun and Alex for your comments.
> >> > > > > > > >
> >> > > > > > > > Point#1: You are right Jun. As Alex mentioned, the
> "derived
> >> > > > metadata"
> >> > > > > > can
> >> > > > > > > > increase the size of cached metadata by a factor of 10 but
> >> it
> >> > > > should
> >> > > > > be
> >> > > > > > > ok
> >> > > > > > > > to cache just the actual metadata. My point about size
> >> being a
> >> > > > > > limitation
> >> > > > > > > > for using cache is not valid anymore.
> >> > > > > > > >
> >> > > > > > > > Point#2: For a new replica, it would still have to fetch
> the
> >> > > > metadata
> >> > > > > > > over
> >> > > > > > > > the network to initiate the warm up of the cache and
> hence,
> >> > > > increase
> >> > > > > > the
> >> > > > > > > > start time of the archival process. Please also note the
> >> > > > > repercussions
> >> > > > > > of
> >> > > > > > > > the warm up scan that Alex mentioned in this thread as
> part
> >> of
> >> > > > > #102.2.
> >> > > > > > > >
> >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that. My point
> >> about
> >> > > size
> >> > > > > > being
> >> > > > > > > a
> >> > > > > > > > limitation for using cache is not valid anymore.
> >> > > > > > > >
> >> > > > > > > > 101#: Alex, if I understand correctly, you are suggesting
> to
> >> > > cache
> >> > > > > the
> >> > > > > > > > total size at the leader and update it on archival. This
> >> > wouldn't
> >> > > > > work
> >> > > > > > > for
> >> > > > > > > > cases when the leader restarts where we would have to
> make a
> >> > full
> >> > > > > scan
> >> > > > > > > > to update the total size entry on startup. We expect users
> >> to
> >> > > store
> >> > > > > > data
> >> > > > > > > > over longer duration in remote storage which increases the
> >> > > > likelihood
> >> > > > > > of
> >> > > > > > > > leader restarts / failovers.
> >> > > > > > > >
> >> > > > > > > > 102#.1: I don't think that the current design accommodates
> >> the
> >> > > fact
> >> > > > > > that
> >> > > > > > > > data corruption could happen at the RLMM plugin (we don't
> >> have
> >> > > > > checksum
> >> > > > > > > as
> >> > > > > > > > a field in metadata as part of KIP405). If data corruption
> >> > > occurs,
> >> > > > w/
> >> > > > > > or
> >> > > > > > > > w/o the cache, it would be a different problem to solve. I
> >> > would
> >> > > > like
> >> > > > > > to
> >> > > > > > > > keep this outside the scope of this KIP.
> >> > > > > > > >
> >> > > > > > > > 102#.2: Agree. This remains as the main concern for using
> >> the
> >> > > cache
> >> > > > > to
> >> > > > > > > > fetch total size.
> >> > > > > > > >
> >> > > > > > > > Regards,
> >> > > > > > > > Divij Vaidya
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre Dupriez <
> >> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi Divij,
> >> > > > > > > > >
> >> > > > > > > > > Thanks for the KIP. Please find some comments based on
> >> what I
> >> > > > read
> >> > > > > on
> >> > > > > > > > > this thread so far - apologies for the repeats and the
> >> late
> >> > > > reply.
> >> > > > > > > > >
> >> > > > > > > > > If I understand correctly, one of the main elements of
> >> > > discussion
> >> > > > > is
> >> > > > > > > > > about caching in Kafka versus delegation of providing
> the
> >> > > remote
> >> > > > > size
> >> > > > > > > > > of a topic-partition to the plugin.
> >> > > > > > > > >
> >> > > > > > > > > A few comments:
> >> > > > > > > > >
> >> > > > > > > > > 100. The size of the “derived metadata” which is managed
> >> by
> >> > the
> >> > > > > > plugin
> >> > > > > > > > > to represent an rlmMetadata can indeed be close to 1 kB
> on
> >> > > > average
> >> > > > > > > > > depending on its own internal structure, e.g. the
> >> redundancy
> >> > it
> >> > > > > > > > > enforces (unfortunately resulting to duplication),
> >> additional
> >> > > > > > > > > information such as checksums and primary and secondary
> >> > > indexable
> >> > > > > > > > > keys. But indeed, the rlmMetadata is itself a lighter
> data
> >> > > > > structure
> >> > > > > > > > > by a factor of 10. And indeed, instead of caching the
> >> > “derived
> >> > > > > > > > > metadata”, only the rlmMetadata could be, which should
> >> > address
> >> > > > the
> >> > > > > > > > > concern regarding the memory occupancy of the cache.
> >> > > > > > > > >
> >> > > > > > > > > 101. I am not sure I fully understand why we would need
> to
> >> > > cache
> >> > > > > the
> >> > > > > > > > > list of rlmMetadata to retain the remote size of a
> >> > > > topic-partition.
> >> > > > > > > > > Since the leader of a topic-partition is, in
> >> non-degenerated
> >> > > > cases,
> >> > > > > > > > > the only actor which can mutate the remote part of the
> >> > > > > > > > > topic-partition, hence its size, it could in theory only
> >> > cache
> >> > > > the
> >> > > > > > > > > size of the remote log once it has calculated it? In
> which
> >> > case
> >> > > > > there
> >> > > > > > > > > would not be any problem regarding the size of the
> caching
> >> > > > > strategy.
> >> > > > > > > > > Did I miss something there?
> >> > > > > > > > >
> >> > > > > > > > > 102. There may be a few challenges to consider with
> >> caching:
> >> > > > > > > > >
> >> > > > > > > > > 102.1) As mentioned above, the caching strategy assumes
> no
> >> > > > mutation
> >> > > > > > > > > outside the lifetime of a leader. While this is true in
> >> the
> >> > > > normal
> >> > > > > > > > > course of operation, there could be accidental mutation
> >> > outside
> >> > > > of
> >> > > > > > the
> >> > > > > > > > > leader and a loss of consistency between the cached
> state
> >> and
> >> > > the
> >> > > > > > > > > actual remote representation of the log. E.g.
> split-brain
> >> > > > > scenarios,
> >> > > > > > > > > bugs in the plugins, bugs in external systems with
> >> mutating
> >> > > > access
> >> > > > > on
> >> > > > > > > > > the derived metadata. In the worst case, a drift between
> >> the
> >> > > > cached
> >> > > > > > > > > size and the actual size could lead to over-deleting
> >> remote
> >> > > data
> >> > > > > > which
> >> > > > > > > > > is a durability risk.
> >> > > > > > > > >
> >> > > > > > > > > The alternative you propose, by making the plugin the
> >> source
> >> > of
> >> > > > > truth
> >> > > > > > > > > w.r.t. to the size of the remote log, can make it easier
> >> to
> >> > > avoid
> >> > > > > > > > > inconsistencies between plugin-managed metadata and the
> >> > remote
> >> > > > log
> >> > > > > > > > > from the perspective of Kafka. On the other hand, plugin
> >> > > vendors
> >> > > > > > would
> >> > > > > > > > > have to implement it with the expected efficiency to
> have
> >> it
> >> > > > yield
> >> > > > > > > > > benefits.
> >> > > > > > > > >
> >> > > > > > > > > 102.2) As you mentioned, the caching strategy in Kafka
> >> would
> >> > > > still
> >> > > > > > > > > require one iteration over the list of rlmMetadata when
> >> the
> >> > > > > > leadership
> >> > > > > > > > > of a topic-partition is assigned to a broker, while the
> >> > plugin
> >> > > > can
> >> > > > > > > > > offer alternative constant-time approaches. This
> >> calculation
> >> > > > cannot
> >> > > > > > be
> >> > > > > > > > > put on the LeaderAndIsr path and would be performed in
> the
> >> > > > > > background.
> >> > > > > > > > > In case of bulk leadership migration, listing the
> >> rlmMetadata
> >> > > > could
> >> > > > > > a)
> >> > > > > > > > > result in request bursts to any backend system the
> plugin
> >> may
> >> > > use
> >> > > > > > > > > [which shouldn’t be a problem for high-throughput data
> >> stores
> >> > > but
> >> > > > > > > > > could have cost implications] b) increase utilisation
> >> > timespan
> >> > > of
> >> > > > > the
> >> > > > > > > > > RLM threads for these calculations potentially leading
> to
> >> > > > transient
> >> > > > > > > > > starvation of tasks queued for, typically, offloading
> >> > > operations
> >> > > > c)
> >> > > > > > > > > could have a non-marginal CPU footprint on hardware with
> >> > strict
> >> > > > > > > > > resource constraints. All these elements could have an
> >> impact
> >> > > to
> >> > > > > some
> >> > > > > > > > > degree depending on the operational environment.
> >> > > > > > > > >
> >> > > > > > > > > From a design perspective, one question is where we want
> >> the
> >> > > > source
> >> > > > > > of
> >> > > > > > > > > truth w.r.t. remote log size to be during the lifetime
> of
> >> a
> >> > > > leader.
> >> > > > > > > > > The responsibility of maintaining a consistent
> >> representation
> >> > > of
> >> > > > > the
> >> > > > > > > > > remote log is shared by Kafka and the plugin. Which
> >> system is
> >> > > > best
> >> > > > > > > > > placed to maintain such a state while providing the
> >> highest
> >> > > > > > > > > consistency guarantees is something both Kafka and
> plugin
> >> > > > designers
> >> > > > > > > > > could help understand better.
> >> > > > > > > > >
> >> > > > > > > > > Many thanks,
> >> > > > > > > > > Alexandre
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> >> > <jun@confluent.io.invalid
> >> > > >
> >> > > > a
> >> > > > > > > > écrit :
> >> > > > > > > > > >
> >> > > > > > > > > > Hi, Divij,
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks for the reply.
> >> > > > > > > > > >
> >> > > > > > > > > > Point #1. Is the average remote segment metadata
> really
> >> > 1KB?
> >> > > > > What's
> >> > > > > > > > > listed
> >> > > > > > > > > > in the public interface is probably well below 100
> >> bytes.
> >> > > > > > > > > >
> >> > > > > > > > > > Point #2. I guess you are assuming that each broker
> only
> >> > > caches
> >> > > > > the
> >> > > > > > > > > remote
> >> > > > > > > > > > segment metadata in memory. An alternative approach is
> >> to
> >> > > cache
> >> > > > > > them
> >> > > > > > > in
> >> > > > > > > > > > both memory and local disk. That way, on broker
> restart,
> >> > you
> >> > > > just
> >> > > > > > > need
> >> > > > > > > > to
> >> > > > > > > > > > fetch the new remote segments' metadata using the
> >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> topicIdPartition,
> >> > int
> >> > > > > > > > leaderEpoch)
> >> > > > > > > > > > api. Will that work?
> >> > > > > > > > > >
> >> > > > > > > > > > Point #3. Thanks for the explanation and it sounds
> good.
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks,
> >> > > > > > > > > >
> >> > > > > > > > > > Jun
> >> > > > > > > > > >
> >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij Vaidya <
> >> > > > > > > divijvaidya13@gmail.com>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hi Jun
> >> > > > > > > > > > >
> >> > > > > > > > > > > There are three points that I would like to present
> >> here:
> >> > > > > > > > > > >
> >> > > > > > > > > > > 1. We would require a large cache size to
> efficiently
> >> > cache
> >> > > > all
> >> > > > > > > > segment
> >> > > > > > > > > > > metadata.
> >> > > > > > > > > > > 2. Linear scan of all metadata at broker startup to
> >> > > populate
> >> > > > > the
> >> > > > > > > > cache
> >> > > > > > > > > will
> >> > > > > > > > > > > be slow and will impact the archival process.
> >> > > > > > > > > > > 3. There is no other use case where a full scan of
> >> > segment
> >> > > > > > metadata
> >> > > > > > > > is
> >> > > > > > > > > > > required.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Let's start by quantifying 1. Here's my estimate for
> >> the
> >> > > size
> >> > > > > of
> >> > > > > > > the
> >> > > > > > > > > cache.
> >> > > > > > > > > > > Average size of segment metadata = 1KB. This could
> be
> >> > more
> >> > > if
> >> > > > > we
> >> > > > > > > have
> >> > > > > > > > > > > frequent leader failover with a large number of
> leader
> >> > > epochs
> >> > > > > > being
> >> > > > > > > > > stored
> >> > > > > > > > > > > per segment.
> >> > > > > > > > > > > Segment size = 100MB. Users will prefer to reduce
> the
> >> > > segment
> >> > > > > > size
> >> > > > > > > > > from the
> >> > > > > > > > > > > default value of 1GB to ensure timely archival of
> data
> >> > > since
> >> > > > > data
> >> > > > > > > > from
> >> > > > > > > > > > > active segment is not archived.
> >> > > > > > > > > > > Cache size = num segments * avg. segment metadata
> >> size =
> >> > > > > > > > > (100TB/100MB)*1KB
> >> > > > > > > > > > > = 1GB.
> >> > > > > > > > > > > While 1GB for cache may not sound like a large
> number
> >> for
> >> > > > > larger
> >> > > > > > > > > machines,
> >> > > > > > > > > > > it does eat into the memory as an additional cache
> and
> >> > > makes
> >> > > > > use
> >> > > > > > > > cases
> >> > > > > > > > > with
> >> > > > > > > > > > > large data retention with low throughout expensive
> >> (where
> >> > > > such
> >> > > > > > use
> >> > > > > > > > case
> >> > > > > > > > > > > would could use smaller machines).
> >> > > > > > > > > > >
> >> > > > > > > > > > > About point#2:
> >> > > > > > > > > > > Even if we say that all segment metadata can fit
> into
> >> the
> >> > > > > cache,
> >> > > > > > we
> >> > > > > > > > > will
> >> > > > > > > > > > > need to populate the cache on broker startup. It
> would
> >> > not
> >> > > be
> >> > > > > in
> >> > > > > > > the
> >> > > > > > > > > > > critical patch of broker startup and hence won't
> >> impact
> >> > the
> >> > > > > > startup
> >> > > > > > > > > time.
> >> > > > > > > > > > > But it will impact the time when we could start the
> >> > > archival
> >> > > > > > > process
> >> > > > > > > > > since
> >> > > > > > > > > > > the RLM thread pool will be blocked on the first
> call
> >> to
> >> > > > > > > > > > > listRemoteLogSegments(). To scan metadata for 1MM
> >> > segments
> >> > > > > > > (computed
> >> > > > > > > > > above)
> >> > > > > > > > > > > and transfer 1GB data over the network from a RLMM
> >> such
> >> > as
> >> > > a
> >> > > > > > remote
> >> > > > > > > > > > > database would be in the order of minutes (depending
> >> on
> >> > how
> >> > > > > > > efficient
> >> > > > > > > > > the
> >> > > > > > > > > > > scan is with the RLMM implementation). Although, I
> >> would
> >> > > > > concede
> >> > > > > > > that
> >> > > > > > > > > > > having RLM threads blocked for a few minutes is
> >> perhaps
> >> > OK
> >> > > > but
> >> > > > > if
> >> > > > > > > we
> >> > > > > > > > > > > introduce the new API proposed in the KIP, we would
> >> have
> >> > a
> >> > > > > > > > > > > deterministic startup time for RLM. Adding the API
> >> comes
> >> > > at a
> >> > > > > low
> >> > > > > > > > cost
> >> > > > > > > > > and
> >> > > > > > > > > > > I believe the trade off is worth it.
> >> > > > > > > > > > >
> >> > > > > > > > > > > About point#3:
> >> > > > > > > > > > > We can use listRemoteLogSegments(TopicIdPartition
> >> > > > > > topicIdPartition,
> >> > > > > > > > int
> >> > > > > > > > > > > leaderEpoch) to calculate the segments eligible for
> >> > > deletion
> >> > > > > > (based
> >> > > > > > > > on
> >> > > > > > > > > size
> >> > > > > > > > > > > retention) where leader epoch(s) belong to the
> current
> >> > > leader
> >> > > > > > epoch
> >> > > > > > > > > chain.
> >> > > > > > > > > > > I understand that it may lead to segments belonging
> to
> >> > > other
> >> > > > > > epoch
> >> > > > > > > > > lineage
> >> > > > > > > > > > > not getting deleted and would require a separate
> >> > mechanism
> >> > > to
> >> > > > > > > delete
> >> > > > > > > > > them.
> >> > > > > > > > > > > The separate mechanism would anyways be required to
> >> > delete
> >> > > > > these
> >> > > > > > > > > "leaked"
> >> > > > > > > > > > > segments as there are other cases which could lead
> to
> >> > leaks
> >> > > > > such
> >> > > > > > as
> >> > > > > > > > > network
> >> > > > > > > > > > > problems with RSM mid way writing through. segment
> >> etc.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Thank you for the replies so far. They have made me
> >> > > re-think
> >> > > > my
> >> > > > > > > > > assumptions
> >> > > > > > > > > > > and this dialogue has been very constructive for me.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Regards,
> >> > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
> >> > > > > > <jun@confluent.io.invalid
> >> > > > > > > >
> >> > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hi, Divij,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Thanks for the reply.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > It's true that the data in Kafka could be kept
> >> longer
> >> > > with
> >> > > > > > > KIP-405.
> >> > > > > > > > > How
> >> > > > > > > > > > > > much data do you envision to have per broker? For
> >> 100TB
> >> > > > data
> >> > > > > > per
> >> > > > > > > > > broker,
> >> > > > > > > > > > > > with 1GB segment and segment metadata of 100
> bytes,
> >> it
> >> > > > > requires
> >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit in memory.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > RemoteLogMetadataManager has two
> >> > listRemoteLogSegments()
> >> > > > > > methods.
> >> > > > > > > > > The one
> >> > > > > > > > > > > > you listed listRemoteLogSegments(TopicIdPartition
> >> > > > > > > topicIdPartition,
> >> > > > > > > > > int
> >> > > > > > > > > > > > leaderEpoch) does return data in offset order.
> >> However,
> >> > > the
> >> > > > > > other
> >> > > > > > > > > > > > one listRemoteLogSegments(TopicIdPartition
> >> > > > topicIdPartition)
> >> > > > > > > > doesn't
> >> > > > > > > > > > > > specify the return order. I assume that you need
> the
> >> > > latter
> >> > > > > to
> >> > > > > > > > > calculate
> >> > > > > > > > > > > > the segment size?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Jun
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij Vaidya <
> >> > > > > > > > > divijvaidya13@gmail.com>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > *Jun,*
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > *"the default implementation of RLMM does local
> >> > > caching,
> >> > > > > > > right?"*
> >> > > > > > > > > > > > > Yes, Jun. The default implementation of RLMM
> does
> >> > > indeed
> >> > > > > > cache
> >> > > > > > > > the
> >> > > > > > > > > > > > segment
> >> > > > > > > > > > > > > metadata today, hence, it won't work for use
> cases
> >> > when
> >> > > > the
> >> > > > > > > > number
> >> > > > > > > > > of
> >> > > > > > > > > > > > > segments in remote storage is large enough to
> >> exceed
> >> > > the
> >> > > > > size
> >> > > > > > > of
> >> > > > > > > > > cache.
> >> > > > > > > > > > > > As
> >> > > > > > > > > > > > > part of this KIP, I will implement the new
> >> proposed
> >> > API
> >> > > > in
> >> > > > > > the
> >> > > > > > > > > default
> >> > > > > > > > > > > > > implementation of RLMM but the underlying
> >> > > implementation
> >> > > > > will
> >> > > > > > > > > still be
> >> > > > > > > > > > > a
> >> > > > > > > > > > > > > scan. I will pick up optimizing that in a
> separate
> >> > PR.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > *"we also cache all segment metadata in the
> >> brokers
> >> > > > without
> >> > > > > > > > > KIP-405. Do
> >> > > > > > > > > > > > you
> >> > > > > > > > > > > > > see a need to change that?"*
> >> > > > > > > > > > > > > Please correct me if I am wrong here but we
> cache
> >> > > > metadata
> >> > > > > > for
> >> > > > > > > > > segments
> >> > > > > > > > > > > > > "residing in local storage". The size of the
> >> current
> >> > > > cache
> >> > > > > > > works
> >> > > > > > > > > fine
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > > the scale of the number of segments that we
> >> expect to
> >> > > > store
> >> > > > > > in
> >> > > > > > > > > local
> >> > > > > > > > > > > > > storage. After KIP-405, that cache will continue
> >> to
> >> > > store
> >> > > > > > > > metadata
> >> > > > > > > > > for
> >> > > > > > > > > > > > > segments which are residing in local storage and
> >> > hence,
> >> > > > we
> >> > > > > > > don't
> >> > > > > > > > > need
> >> > > > > > > > > > > to
> >> > > > > > > > > > > > > change that. For segments which have been
> >> offloaded
> >> > to
> >> > > > > remote
> >> > > > > > > > > storage,
> >> > > > > > > > > > > it
> >> > > > > > > > > > > > > would rely on RLMM. Note that the scale of data
> >> > stored
> >> > > in
> >> > > > > > RLMM
> >> > > > > > > is
> >> > > > > > > > > > > > different
> >> > > > > > > > > > > > > from local cache because the number of segments
> is
> >> > > > expected
> >> > > > > > to
> >> > > > > > > be
> >> > > > > > > > > much
> >> > > > > > > > > > > > > larger than what current implementation stores
> in
> >> > local
> >> > > > > > > storage.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > 2,3,4:
> >> > RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > does
> >> > > > > > > > > specify
> >> > > > > > > > > > > the
> >> > > > > > > > > > > > > order i.e. it returns the segments sorted by
> first
> >> > > offset
> >> > > > > in
> >> > > > > > > > > ascending
> >> > > > > > > > > > > > > order. I am copying the API docs for KIP-405
> here
> >> for
> >> > > > your
> >> > > > > > > > > reference
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > *Returns iterator of remote log segment
> metadata,
> >> > > sorted
> >> > > > by
> >> > > > > > > > {@link
> >> > > > > > > > > > > > > RemoteLogSegmentMetadata#startOffset()}
> >> inascending
> >> > > order
> >> > > > > > which
> >> > > > > > > > > > > contains
> >> > > > > > > > > > > > > the given leader epoch. This is used by remote
> log
> >> > > > > retention
> >> > > > > > > > > management
> >> > > > > > > > > > > > > subsystemto fetch the segment metadata for a
> given
> >> > > leader
> >> > > > > > > > > epoch.@param
> >> > > > > > > > > > > > > topicIdPartition topic partition@param
> >> leaderEpoch
> >> > > > > > leader
> >> > > > > > > > > > > > > epoch@return
> >> > > > > > > > > > > > > Iterator of remote segments, sorted by start
> >> offset
> >> > in
> >> > > > > > > ascending
> >> > > > > > > > > > > order. *
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > *Luke,*
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > 5. Note that we are trying to optimize the
> >> efficiency
> >> > > of
> >> > > > > size
> >> > > > > > > > based
> >> > > > > > > > > > > > > retention for remote storage. KIP-405 does not
> >> > > introduce
> >> > > > a
> >> > > > > > new
> >> > > > > > > > > config
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > > periodically checking remote similar to
> >> > > > > > > > > > > log.retention.check.interval.ms
> >> > > > > > > > > > > > > which is applicable for remote storage. Hence,
> the
> >> > > metric
> >> > > > > > will
> >> > > > > > > be
> >> > > > > > > > > > > updated
> >> > > > > > > > > > > > > at the time of invoking log retention check for
> >> > remote
> >> > > > tier
> >> > > > > > > which
> >> > > > > > > > > is
> >> > > > > > > > > > > > > pending implementation today. We can perhaps
> come
> >> > back
> >> > > > and
> >> > > > > > > update
> >> > > > > > > > > the
> >> > > > > > > > > > > > > metric description after the implementation of
> log
> >> > > > > retention
> >> > > > > > > > check
> >> > > > > > > > > in
> >> > > > > > > > > > > > > RemoteLogManager.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > --
> >> > > > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke Chen <
> >> > > > > showuon@gmail.com
> >> > > > > > >
> >> > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Hi Divij,
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > One more question about the metric:
> >> > > > > > > > > > > > > > I think the metric will be updated when
> >> > > > > > > > > > > > > > (1) each time we run the log retention check
> >> (that
> >> > > is,
> >> > > > > > > > > > > > > > log.retention.check.interval.ms)
> >> > > > > > > > > > > > > > (2) When user explicitly call getRemoteLogSize
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Is that correct?
> >> > > > > > > > > > > > > > Maybe we should add a note in metric
> >> description,
> >> > > > > > otherwise,
> >> > > > > > > > when
> >> > > > > > > > > > > user
> >> > > > > > > > > > > > > got,
> >> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes, will be
> >> > surprised.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Otherwise, LGTM
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Thank you for the KIP
> >> > > > > > > > > > > > > > Luke
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun Rao
> >> > > > > > > > <jun@confluent.io.invalid
> >> > > > > > > > > >
> >> > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Hi, Divij,
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Thanks for the explanation.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > 1. Hmm, the default implementation of RLMM
> >> does
> >> > > local
> >> > > > > > > > caching,
> >> > > > > > > > > > > right?
> >> > > > > > > > > > > > > > > Currently, we also cache all segment
> metadata
> >> in
> >> > > the
> >> > > > > > > brokers
> >> > > > > > > > > > > without
> >> > > > > > > > > > > > > > > KIP-405. Do you see a need to change that?
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation makes sense.
> >> > However,
> >> > > > > > > > > > > > > > > currently,
> >> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > > > > > doesn't
> >> > > > > > > > > > > > > > specify
> >> > > > > > > > > > > > > > > a particular order of the iterator. Do you
> >> intend
> >> > > to
> >> > > > > > change
> >> > > > > > > > > that?
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Jun
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM Divij Vaidya
> <
> >> > > > > > > > > > > divijvaidya13@gmail.com
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Hey Jun
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Thank you for your comments.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > *1. "RLMM implementor could ensure that
> >> > > > > > > > > listRemoteLogSegments()
> >> > > > > > > > > > > is
> >> > > > > > > > > > > > > > fast"*
> >> > > > > > > > > > > > > > > > This would be ideal but pragmatically, it
> is
> >> > > > > difficult
> >> > > > > > to
> >> > > > > > > > > ensure
> >> > > > > > > > > > > > that
> >> > > > > > > > > > > > > > > > listRemoteLogSegments() is fast. This is
> >> > because
> >> > > of
> >> > > > > the
> >> > > > > > > > > > > possibility
> >> > > > > > > > > > > > > of
> >> > > > > > > > > > > > > > a
> >> > > > > > > > > > > > > > > > large number of segments (much larger than
> >> what
> >> > > > Kafka
> >> > > > > > > > > currently
> >> > > > > > > > > > > > > handles
> >> > > > > > > > > > > > > > > > with local storage today) would make it
> >> > > infeasible
> >> > > > to
> >> > > > > > > adopt
> >> > > > > > > > > > > > > strategies
> >> > > > > > > > > > > > > > > such
> >> > > > > > > > > > > > > > > > as local caching to improve the
> performance
> >> of
> >> > > > > > > > > > > > listRemoteLogSegments.
> >> > > > > > > > > > > > > > > Apart
> >> > > > > > > > > > > > > > > > from caching (which won't work due to size
> >> > > > > > limitations) I
> >> > > > > > > > > can't
> >> > > > > > > > > > > > think
> >> > > > > > > > > > > > > > of
> >> > > > > > > > > > > > > > > > other strategies which may eliminate the
> >> need
> >> > for
> >> > > > IO
> >> > > > > > > > > > > > > > > > operations proportional to the number of
> >> total
> >> > > > > > segments.
> >> > > > > > > > > Please
> >> > > > > > > > > > > > > advise
> >> > > > > > > > > > > > > > if
> >> > > > > > > > > > > > > > > > you have something in mind.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds the retention
> >> size,
> >> > we
> >> > > > need
> >> > > > > > to
> >> > > > > > > > > > > determine
> >> > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > subset of segments to delete to bring the
> >> size
> >> > > > within
> >> > > > > > the
> >> > > > > > > > > > > retention
> >> > > > > > > > > > > > > > > limit.
> >> > > > > > > > > > > > > > > > Do we need to call
> >> > > > > > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > > determine that?"*
> >> > > > > > > > > > > > > > > > Yes, we need to call
> >> listRemoteLogSegments() to
> >> > > > > > determine
> >> > > > > > > > > which
> >> > > > > > > > > > > > > > segments
> >> > > > > > > > > > > > > > > > should be deleted. But there is a
> difference
> >> > with
> >> > > > the
> >> > > > > > use
> >> > > > > > > > > case we
> >> > > > > > > > > > > > are
> >> > > > > > > > > > > > > > > > trying to optimize with this KIP. To
> >> determine
> >> > > the
> >> > > > > > subset
> >> > > > > > > > of
> >> > > > > > > > > > > > segments
> >> > > > > > > > > > > > > > > which
> >> > > > > > > > > > > > > > > > would be deleted, we only read metadata
> for
> >> > > > segments
> >> > > > > > > which
> >> > > > > > > > > would
> >> > > > > > > > > > > be
> >> > > > > > > > > > > > > > > deleted
> >> > > > > > > > > > > > > > > > via the listRemoteLogSegments(). But to
> >> > determine
> >> > > > the
> >> > > > > > > > > > > totalLogSize,
> >> > > > > > > > > > > > > > which
> >> > > > > > > > > > > > > > > > is required every time retention logic
> >> based on
> >> > > > size
> >> > > > > > > > > executes, we
> >> > > > > > > > > > > > > read
> >> > > > > > > > > > > > > > > > metadata of *all* the segments in remote
> >> > storage.
> >> > > > > > Hence,
> >> > > > > > > > the
> >> > > > > > > > > > > number
> >> > > > > > > > > > > > > of
> >> > > > > > > > > > > > > > > > results returned by
> >> > > > > > > > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > > > > > > > > > > *is
> >> > > > > > > > > > > > > > > > different when we are calculating
> >> totalLogSize
> >> > > vs.
> >> > > > > when
> >> > > > > > > we
> >> > > > > > > > > are
> >> > > > > > > > > > > > > > > determining
> >> > > > > > > > > > > > > > > > the subset of segments to delete.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > 3.
> >> > > > > > > > > > > > > > > > *"Also, what about time-based retention?
> To
> >> > make
> >> > > > that
> >> > > > > > > > > efficient,
> >> > > > > > > > > > > do
> >> > > > > > > > > > > > > we
> >> > > > > > > > > > > > > > > need
> >> > > > > > > > > > > > > > > > to make some additional interface
> >> changes?"*No.
> >> > > > Note
> >> > > > > > that
> >> > > > > > > > > time
> >> > > > > > > > > > > > > > complexity
> >> > > > > > > > > > > > > > > > to determine the segments for retention is
> >> > > > different
> >> > > > > > for
> >> > > > > > > > time
> >> > > > > > > > > > > based
> >> > > > > > > > > > > > > vs.
> >> > > > > > > > > > > > > > > > size based. For time based, the time
> >> complexity
> >> > > is
> >> > > > a
> >> > > > > > > > > function of
> >> > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > number
> >> > > > > > > > > > > > > > > > of segments which are "eligible for
> >> deletion"
> >> > > > (since
> >> > > > > we
> >> > > > > > > > only
> >> > > > > > > > > read
> >> > > > > > > > > > > > > > > metadata
> >> > > > > > > > > > > > > > > > for segments which would be deleted)
> >> whereas in
> >> > > > size
> >> > > > > > > based
> >> > > > > > > > > > > > retention,
> >> > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > time complexity is a function of "all
> >> segments"
> >> > > > > > available
> >> > > > > > > > in
> >> > > > > > > > > > > remote
> >> > > > > > > > > > > > > > > storage
> >> > > > > > > > > > > > > > > > (metadata of all segments needs to be read
> >> to
> >> > > > > calculate
> >> > > > > > > the
> >> > > > > > > > > total
> >> > > > > > > > > > > > > > size).
> >> > > > > > > > > > > > > > > As
> >> > > > > > > > > > > > > > > > you may observe, this KIP will bring the
> >> time
> >> > > > > > complexity
> >> > > > > > > > for
> >> > > > > > > > > both
> >> > > > > > > > > > > > > time
> >> > > > > > > > > > > > > > > > based retention & size based retention to
> >> the
> >> > > same
> >> > > > > > > > function.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > 4. Also, please note that this new API
> >> > introduced
> >> > > > in
> >> > > > > > this
> >> > > > > > > > KIP
> >> > > > > > > > > > > also
> >> > > > > > > > > > > > > > > enables
> >> > > > > > > > > > > > > > > > us to provide a metric for total size of
> >> data
> >> > > > stored
> >> > > > > in
> >> > > > > > > > > remote
> >> > > > > > > > > > > > > storage.
> >> > > > > > > > > > > > > > > > Without the API, calculation of this
> metric
> >> > will
> >> > > > > become
> >> > > > > > > > very
> >> > > > > > > > > > > > > expensive
> >> > > > > > > > > > > > > > > with
> >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> >> > > > > > > > > > > > > > > > I understand that your motivation here is
> to
> >> > > avoid
> >> > > > > > > > polluting
> >> > > > > > > > > the
> >> > > > > > > > > > > > > > > interface
> >> > > > > > > > > > > > > > > > with optimization specific APIs and I will
> >> > agree
> >> > > > with
> >> > > > > > > that
> >> > > > > > > > > goal.
> >> > > > > > > > > > > > But
> >> > > > > > > > > > > > > I
> >> > > > > > > > > > > > > > > > believe that this new API proposed in the
> >> KIP
> >> > > > brings
> >> > > > > in
> >> > > > > > > > > > > significant
> >> > > > > > > > > > > > > > > > improvement and there is no other work
> >> around
> >> > > > > available
> >> > > > > > > to
> >> > > > > > > > > > > achieve
> >> > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > same
> >> > > > > > > > > > > > > > > > performance.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Regards,
> >> > > > > > > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM Jun Rao
> >> > > > > > > > > <jun@confluent.io.invalid
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Hi, Divij,
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry for the late
> >> reply.
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > The motivation of the KIP is to improve
> >> the
> >> > > > > > efficiency
> >> > > > > > > of
> >> > > > > > > > > size
> >> > > > > > > > > > > > > based
> >> > > > > > > > > > > > > > > > > retention. I am not sure the proposed
> >> changes
> >> > > are
> >> > > > > > > enough.
> >> > > > > > > > > For
> >> > > > > > > > > > > > > > example,
> >> > > > > > > > > > > > > > > if
> >> > > > > > > > > > > > > > > > > the size exceeds the retention size, we
> >> need
> >> > to
> >> > > > > > > determine
> >> > > > > > > > > the
> >> > > > > > > > > > > > > subset
> >> > > > > > > > > > > > > > of
> >> > > > > > > > > > > > > > > > > segments to delete to bring the size
> >> within
> >> > the
> >> > > > > > > retention
> >> > > > > > > > > > > limit.
> >> > > > > > > > > > > > Do
> >> > > > > > > > > > > > > > we
> >> > > > > > > > > > > > > > > > need
> >> > > > > > > > > > > > > > > > > to call
> >> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > > > > to
> >> > > > > > > > > > > > > determine
> >> > > > > > > > > > > > > > > > that?
> >> > > > > > > > > > > > > > > > > Also, what about time-based retention?
> To
> >> > make
> >> > > > that
> >> > > > > > > > > efficient,
> >> > > > > > > > > > > do
> >> > > > > > > > > > > > > we
> >> > > > > > > > > > > > > > > need
> >> > > > > > > > > > > > > > > > > to make some additional interface
> changes?
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > An alternative approach is for the RLMM
> >> > > > implementor
> >> > > > > > to
> >> > > > > > > > make
> >> > > > > > > > > > > sure
> >> > > > > > > > > > > > > > > > > that
> >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > > > is
> >> > > > > > > > > fast
> >> > > > > > > > > > > > > (e.g.,
> >> > > > > > > > > > > > > > > with
> >> > > > > > > > > > > > > > > > > local caching). This way, we could keep
> >> the
> >> > > > > interface
> >> > > > > > > > > simple.
> >> > > > > > > > > > > > Have
> >> > > > > > > > > > > > > we
> >> > > > > > > > > > > > > > > > > considered that?
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Jun
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28 AM Divij
> >> Vaidya
> >> > <
> >> > > > > > > > > > > > > > divijvaidya13@gmail.com>
> >> > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > Hey folks
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > Does anyone else have any thoughts on
> >> this
> >> > > > > before I
> >> > > > > > > > > propose
> >> > > > > > > > > > > > this
> >> > > > > > > > > > > > > > for
> >> > > > > > > > > > > > > > > a
> >> > > > > > > > > > > > > > > > > > vote?
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57 PM Satish
> >> > > Duggana
> >> > > > <
> >> > > > > > > > > > > > > > > > satish.duggana@gmail.com
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > Thanks for the KIP Divij!
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > This is a nice improvement to avoid
> >> > > > > recalculation
> >> > > > > > > of
> >> > > > > > > > > size.
> >> > > > > > > > > > > > > > > Customized
> >> > > > > > > > > > > > > > > > > > RLMMs
> >> > > > > > > > > > > > > > > > > > > can implement the best possible
> >> approach
> >> > by
> >> > > > > > caching
> >> > > > > > > > or
> >> > > > > > > > > > > > > > maintaining
> >> > > > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > > > size
> >> > > > > > > > > > > > > > > > > > > in an efficient way. But this is
> not a
> >> > big
> >> > > > > > concern
> >> > > > > > > > for
> >> > > > > > > > > the
> >> > > > > > > > > > > > > > default
> >> > > > > > > > > > > > > > > > > topic
> >> > > > > > > > > > > > > > > > > > > based RLMM as mentioned in the KIP.
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > ~Satish.
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at 18:48, Divij
> >> > Vaidya
> >> > > <
> >> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
> >> > > > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Thank you for your review Luke.
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > Reg: is that would the new
> >> > > > > > `RemoteLogSizeBytes`
> >> > > > > > > > > metric
> >> > > > > > > > > > > > be a
> >> > > > > > > > > > > > > > > > > > performance
> >> > > > > > > > > > > > > > > > > > > > overhead? Although we move the
> >> > > calculation
> >> > > > > to a
> >> > > > > > > > > seperate
> >> > > > > > > > > > > > API,
> >> > > > > > > > > > > > > > we
> >> > > > > > > > > > > > > > > > > still
> >> > > > > > > > > > > > > > > > > > > > can't assume users will implement
> a
> >> > > > > > light-weight
> >> > > > > > > > > method,
> >> > > > > > > > > > > > > right?
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > This metric would be logged using
> >> the
> >> > > > > > information
> >> > > > > > > > > that is
> >> > > > > > > > > > > > > > already
> >> > > > > > > > > > > > > > > > > being
> >> > > > > > > > > > > > > > > > > > > > calculated for handling remote
> >> > retention
> >> > > > > logic,
> >> > > > > > > > > hence, no
> >> > > > > > > > > > > > > > > > additional
> >> > > > > > > > > > > > > > > > > > work
> >> > > > > > > > > > > > > > > > > > > > is required to calculate this
> >> metric.
> >> > > More
> >> > > > > > > > > specifically,
> >> > > > > > > > > > > > > > whenever
> >> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls
> >> getRemoteLogSize
> >> > > > API,
> >> > > > > > this
> >> > > > > > > > > metric
> >> > > > > > > > > > > > > would
> >> > > > > > > > > > > > > > be
> >> > > > > > > > > > > > > > > > > > > captured.
> >> > > > > > > > > > > > > > > > > > > > This API call is made every time
> >> > > > > > RemoteLogManager
> >> > > > > > > > > wants
> >> > > > > > > > > > > to
> >> > > > > > > > > > > > > > handle
> >> > > > > > > > > > > > > > > > > > expired
> >> > > > > > > > > > > > > > > > > > > > remote log segments (which should
> be
> >> > > > > periodic).
> >> > > > > > > > Does
> >> > > > > > > > > that
> >> > > > > > > > > > > > > > address
> >> > > > > > > > > > > > > > > > > your
> >> > > > > > > > > > > > > > > > > > > > concern?
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at 11:01 AM
> >> Luke
> >> > > Chen
> >> > > > <
> >> > > > > > > > > > > > > showuon@gmail.com>
> >> > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP!
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > I think it makes sense to
> delegate
> >> > the
> >> > > > > > > > > responsibility
> >> > > > > > > > > > > of
> >> > > > > > > > > > > > > > > > > calculation
> >> > > > > > > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > > > > > > specific
> RemoteLogMetadataManager
> >> > > > > > > implementation.
> >> > > > > > > > > > > > > > > > > > > > > But one thing I'm not quite
> sure,
> >> is
> >> > > that
> >> > > > > > would
> >> > > > > > > > > the new
> >> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes` metric be a
> >> > > > > performance
> >> > > > > > > > > overhead?
> >> > > > > > > > > > > > > > > > > > > > > Although we move the calculation
> >> to a
> >> > > > > > seperate
> >> > > > > > > > > API, we
> >> > > > > > > > > > > > > still
> >> > > > > > > > > > > > > > > > can't
> >> > > > > > > > > > > > > > > > > > > assume
> >> > > > > > > > > > > > > > > > > > > > > users will implement a
> >> light-weight
> >> > > > method,
> >> > > > > > > > right?
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > Thank you.
> >> > > > > > > > > > > > > > > > > > > > > Luke
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at 5:47 PM
> >> Divij
> >> > > > > Vaidya <
> >> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > Please take a look at this KIP
> >> > which
> >> > > > > > proposes
> >> > > > > > > > an
> >> > > > > > > > > > > > > extension
> >> > > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > > > > > KIP-405.
> >> > > > > > > > > > > > > > > > > > > > > This
> >> > > > > > > > > > > > > > > > > > > > > > is my first KIP with Apache
> >> Kafka
> >> > > > > community
> >> > > > > > > so
> >> > > > > > > > > any
> >> > > > > > > > > > > > > feedback
> >> > > > > > > > > > > > > > > > would
> >> > > > > > > > > > > > > > > > > > be
> >> > > > > > > > > > > > > > > > > > > > > highly
> >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > > > > > > > > > > > > > Sr. Software Engineer
> >> > > > > > > > > > > > > > > > > > > > > > Amazon
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Posted by Divij Vaidya <di...@gmail.com>.
Satish / Jun

Do you have any thoughts on this?

--
Divij Vaidya



On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <di...@gmail.com>
wrote:

> Hey Jun
>
> It has been a while since this KIP got some attention. While we wait for
> Satish to chime in here, perhaps I can answer your question.
>
> > Could you explain how you exposed the log size in your KIP-405
> implementation?
>
> The APIs available in RLMM as per KIP405
> are, addRemoteLogSegmentMetadata(), updateRemoteLogSegmentMetadata(), remoteLogSegmentMetadata(), highestOffsetForEpoch(), putRemotePartitionDeleteMetadata(), listRemoteLogSegments(), onPartitionLeadershipChanges()
> and onStopPartitions(). None of these APIs allow us to expose the log size,
> hence, the only option that remains is to list all segments using
> listRemoteLogSegments() and aggregate them every time we require to
> calculate the size. Based on our prior discussion, this requires reading
> all segment metadata which won't work for non-local RLMM implementations.
> Satish's implementation also performs a full scan and calculates the
> aggregate. see:
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
>
>
> Does this answer your question?
>
> --
> Divij Vaidya
>
>
>
> On Tue, Dec 20, 2022 at 8:40 PM Jun Rao <ju...@confluent.io.invalid> wrote:
>
>> Hi, Divij,
>>
>> Thanks for the explanation.
>>
>> Good question.
>>
>> Hi, Satish,
>>
>> Could you explain how you exposed the log size in your KIP-405
>> implementation?
>>
>> Thanks,
>>
>> Jun
>>
>> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <di...@gmail.com>
>> wrote:
>>
>> > Hey Jun
>> >
>> > Yes, it is possible to maintain the log size in the cache (see rejected
>> > alternative#3 in the KIP) but I did not understand how it is possible to
>> > retrieve it without the new API. The log size could be calculated on
>> > startup by scanning through the segments (though I would disagree that
>> this
>> > is the right approach since scanning itself takes order of minutes and
>> > hence delay the start of archive process), and incrementally maintained
>> > afterwards, even then, we would need an API in RemoteLogMetadataManager
>> so
>> > that RLM could fetch the cached size!
>> >
>> > If we wish to cache the size without adding a new API, then we need to
>> > cache the size in RLM itself (instead of RLMM implementation) and
>> > incrementally manage it. The downside of longer archive time at startup
>> > (due to initial scale) still remains valid in this situation.
>> >
>> > --
>> > Divij Vaidya
>> >
>> >
>> >
>> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao <ju...@confluent.io.invalid>
>> wrote:
>> >
>> > > Hi, Divij,
>> > >
>> > > Thanks for the explanation.
>> > >
>> > > If there is in-memory cache, could we maintain the log size in the
>> cache
>> > > with the existing API? For example, a replica could make a
>> > > listRemoteLogSegments(TopicIdPartition topicIdPartition) call on
>> startup
>> > to
>> > > get the remote segment size before the current leaderEpoch. The leader
>> > > could then maintain the size incrementally afterwards. On leader
>> change,
>> > > other replicas can make a listRemoteLogSegments(TopicIdPartition
>> > > topicIdPartition, int leaderEpoch) call to get the size of newly
>> > generated
>> > > segments.
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > >
>> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <divijvaidya13@gmail.com
>> >
>> > > wrote:
>> > >
>> > > > > Is the new method enough for doing size-based retention?
>> > > >
>> > > > Yes. You are right in assuming that this API only provides the
>> Remote
>> > > > storage size (for current epoch chain). We would use this API for
>> size
>> > > > based retention along with a value of localOnlyLogSegmentSize which
>> is
>> > > > computed as Log.sizeInBytes(logSegments.filter(_.baseOffset >
>> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size =
>> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I have updated
>> the
>> > KIP
>> > > > with this information. You can also check an example implementation
>> at
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
>> > > >
>> > > >
>> > > > > Do you imagine all accesses to remote metadata will be across the
>> > > network
>> > > > or will there be some local in-memory cache?
>> > > >
>> > > > I would expect a disk-less implementation to maintain a finite
>> > in-memory
>> > > > cache for segment metadata to optimize the number of network calls
>> made
>> > > to
>> > > > fetch the data. In future, we can think about bringing this finite
>> size
>> > > > cache into RLM itself but that's probably a conversation for a
>> > different
>> > > > KIP. There are many other things we would like to do to optimize the
>> > > Tiered
>> > > > storage interface such as introducing a circular buffer / streaming
>> > > > interface from RSM (so that we don't have to wait to fetch the
>> entire
>> > > > segment before starting to send records to the consumer), caching
>> the
>> > > > segments fetched from RSM locally (I would assume all RSM plugin
>> > > > implementations to do this, might as well add it to RLM) etc.
>> > > >
>> > > > --
>> > > > Divij Vaidya
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao <ju...@confluent.io.invalid>
>> > > wrote:
>> > > >
>> > > > > Hi, Divij,
>> > > > >
>> > > > > Thanks for the reply.
>> > > > >
>> > > > > Is the new method enough for doing size-based retention? It gives
>> the
>> > > > total
>> > > > > size of the remote segments, but it seems that we still don't know
>> > the
>> > > > > exact total size for a log since there could be overlapping
>> segments
>> > > > > between the remote and the local segments.
>> > > > >
>> > > > > You mentioned a disk-less implementation. Do you imagine all
>> accesses
>> > > to
>> > > > > remote metadata will be across the network or will there be some
>> > local
>> > > > > in-memory cache?
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
>> divijvaidya13@gmail.com
>> > >
>> > > > > wrote:
>> > > > >
>> > > > > > The method is needed for RLMM implementations which fetch the
>> > > > information
>> > > > > > over the network and not for the disk based implementations
>> (such
>> > as
>> > > > the
>> > > > > > default topic based RLMM).
>> > > > > >
>> > > > > > I would argue that adding this API makes the interface more
>> generic
>> > > > than
>> > > > > > what it is today. This is because, with the current APIs an
>> > > implementor
>> > > > > is
>> > > > > > restricted to use disk based RLMM solutions only (i.e. the
>> default
>> > > > > > solution) whereas if we add this new API, we unblock usage of
>> > network
>> > > > > based
>> > > > > > RLMM implementations such as databases.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao <jun@confluent.io.invalid
>> >
>> > > > wrote:
>> > > > > >
>> > > > > > > Hi, Divij,
>> > > > > > >
>> > > > > > > Thanks for the reply.
>> > > > > > >
>> > > > > > > Point#2. My high level question is that is the new method
>> needed
>> > > for
>> > > > > > every
>> > > > > > > implementation of remote storage or just for a specific
>> > > > implementation.
>> > > > > > The
>> > > > > > > issues that you pointed out exist for the default
>> implementation
>> > of
>> > > > > RLMM
>> > > > > > as
>> > > > > > > well and so far, the default implementation hasn't found a
>> need
>> > > for a
>> > > > > > > similar new method. For public interface, ideally we want to
>> make
>> > > it
>> > > > > more
>> > > > > > > general.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > > Jun
>> > > > > > >
>> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
>> > > > divijvaidya13@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Thank you Jun and Alex for your comments.
>> > > > > > > >
>> > > > > > > > Point#1: You are right Jun. As Alex mentioned, the "derived
>> > > > metadata"
>> > > > > > can
>> > > > > > > > increase the size of cached metadata by a factor of 10 but
>> it
>> > > > should
>> > > > > be
>> > > > > > > ok
>> > > > > > > > to cache just the actual metadata. My point about size
>> being a
>> > > > > > limitation
>> > > > > > > > for using cache is not valid anymore.
>> > > > > > > >
>> > > > > > > > Point#2: For a new replica, it would still have to fetch the
>> > > > metadata
>> > > > > > > over
>> > > > > > > > the network to initiate the warm up of the cache and hence,
>> > > > increase
>> > > > > > the
>> > > > > > > > start time of the archival process. Please also note the
>> > > > > repercussions
>> > > > > > of
>> > > > > > > > the warm up scan that Alex mentioned in this thread as part
>> of
>> > > > > #102.2.
>> > > > > > > >
>> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that. My point
>> about
>> > > size
>> > > > > > being
>> > > > > > > a
>> > > > > > > > limitation for using cache is not valid anymore.
>> > > > > > > >
>> > > > > > > > 101#: Alex, if I understand correctly, you are suggesting to
>> > > cache
>> > > > > the
>> > > > > > > > total size at the leader and update it on archival. This
>> > wouldn't
>> > > > > work
>> > > > > > > for
>> > > > > > > > cases when the leader restarts where we would have to make a
>> > full
>> > > > > scan
>> > > > > > > > to update the total size entry on startup. We expect users
>> to
>> > > store
>> > > > > > data
>> > > > > > > > over longer duration in remote storage which increases the
>> > > > likelihood
>> > > > > > of
>> > > > > > > > leader restarts / failovers.
>> > > > > > > >
>> > > > > > > > 102#.1: I don't think that the current design accommodates
>> the
>> > > fact
>> > > > > > that
>> > > > > > > > data corruption could happen at the RLMM plugin (we don't
>> have
>> > > > > checksum
>> > > > > > > as
>> > > > > > > > a field in metadata as part of KIP405). If data corruption
>> > > occurs,
>> > > > w/
>> > > > > > or
>> > > > > > > > w/o the cache, it would be a different problem to solve. I
>> > would
>> > > > like
>> > > > > > to
>> > > > > > > > keep this outside the scope of this KIP.
>> > > > > > > >
>> > > > > > > > 102#.2: Agree. This remains as the main concern for using
>> the
>> > > cache
>> > > > > to
>> > > > > > > > fetch total size.
>> > > > > > > >
>> > > > > > > > Regards,
>> > > > > > > > Divij Vaidya
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre Dupriez <
>> > > > > > > > alexandre.dupriez@gmail.com> wrote:
>> > > > > > > >
>> > > > > > > > > Hi Divij,
>> > > > > > > > >
>> > > > > > > > > Thanks for the KIP. Please find some comments based on
>> what I
>> > > > read
>> > > > > on
>> > > > > > > > > this thread so far - apologies for the repeats and the
>> late
>> > > > reply.
>> > > > > > > > >
>> > > > > > > > > If I understand correctly, one of the main elements of
>> > > discussion
>> > > > > is
>> > > > > > > > > about caching in Kafka versus delegation of providing the
>> > > remote
>> > > > > size
>> > > > > > > > > of a topic-partition to the plugin.
>> > > > > > > > >
>> > > > > > > > > A few comments:
>> > > > > > > > >
>> > > > > > > > > 100. The size of the “derived metadata” which is managed
>> by
>> > the
>> > > > > > plugin
>> > > > > > > > > to represent an rlmMetadata can indeed be close to 1 kB on
>> > > > average
>> > > > > > > > > depending on its own internal structure, e.g. the
>> redundancy
>> > it
>> > > > > > > > > enforces (unfortunately resulting to duplication),
>> additional
>> > > > > > > > > information such as checksums and primary and secondary
>> > > indexable
>> > > > > > > > > keys. But indeed, the rlmMetadata is itself a lighter data
>> > > > > structure
>> > > > > > > > > by a factor of 10. And indeed, instead of caching the
>> > “derived
>> > > > > > > > > metadata”, only the rlmMetadata could be, which should
>> > address
>> > > > the
>> > > > > > > > > concern regarding the memory occupancy of the cache.
>> > > > > > > > >
>> > > > > > > > > 101. I am not sure I fully understand why we would need to
>> > > cache
>> > > > > the
>> > > > > > > > > list of rlmMetadata to retain the remote size of a
>> > > > topic-partition.
>> > > > > > > > > Since the leader of a topic-partition is, in
>> non-degenerated
>> > > > cases,
>> > > > > > > > > the only actor which can mutate the remote part of the
>> > > > > > > > > topic-partition, hence its size, it could in theory only
>> > cache
>> > > > the
>> > > > > > > > > size of the remote log once it has calculated it? In which
>> > case
>> > > > > there
>> > > > > > > > > would not be any problem regarding the size of the caching
>> > > > > strategy.
>> > > > > > > > > Did I miss something there?
>> > > > > > > > >
>> > > > > > > > > 102. There may be a few challenges to consider with
>> caching:
>> > > > > > > > >
>> > > > > > > > > 102.1) As mentioned above, the caching strategy assumes no
>> > > > mutation
>> > > > > > > > > outside the lifetime of a leader. While this is true in
>> the
>> > > > normal
>> > > > > > > > > course of operation, there could be accidental mutation
>> > outside
>> > > > of
>> > > > > > the
>> > > > > > > > > leader and a loss of consistency between the cached state
>> and
>> > > the
>> > > > > > > > > actual remote representation of the log. E.g. split-brain
>> > > > > scenarios,
>> > > > > > > > > bugs in the plugins, bugs in external systems with
>> mutating
>> > > > access
>> > > > > on
>> > > > > > > > > the derived metadata. In the worst case, a drift between
>> the
>> > > > cached
>> > > > > > > > > size and the actual size could lead to over-deleting
>> remote
>> > > data
>> > > > > > which
>> > > > > > > > > is a durability risk.
>> > > > > > > > >
>> > > > > > > > > The alternative you propose, by making the plugin the
>> source
>> > of
>> > > > > truth
>> > > > > > > > > w.r.t. to the size of the remote log, can make it easier
>> to
>> > > avoid
>> > > > > > > > > inconsistencies between plugin-managed metadata and the
>> > remote
>> > > > log
>> > > > > > > > > from the perspective of Kafka. On the other hand, plugin
>> > > vendors
>> > > > > > would
>> > > > > > > > > have to implement it with the expected efficiency to have
>> it
>> > > > yield
>> > > > > > > > > benefits.
>> > > > > > > > >
>> > > > > > > > > 102.2) As you mentioned, the caching strategy in Kafka
>> would
>> > > > still
>> > > > > > > > > require one iteration over the list of rlmMetadata when
>> the
>> > > > > > leadership
>> > > > > > > > > of a topic-partition is assigned to a broker, while the
>> > plugin
>> > > > can
>> > > > > > > > > offer alternative constant-time approaches. This
>> calculation
>> > > > cannot
>> > > > > > be
>> > > > > > > > > put on the LeaderAndIsr path and would be performed in the
>> > > > > > background.
>> > > > > > > > > In case of bulk leadership migration, listing the
>> rlmMetadata
>> > > > could
>> > > > > > a)
>> > > > > > > > > result in request bursts to any backend system the plugin
>> may
>> > > use
>> > > > > > > > > [which shouldn’t be a problem for high-throughput data
>> stores
>> > > but
>> > > > > > > > > could have cost implications] b) increase utilisation
>> > timespan
>> > > of
>> > > > > the
>> > > > > > > > > RLM threads for these calculations potentially leading to
>> > > > transient
>> > > > > > > > > starvation of tasks queued for, typically, offloading
>> > > operations
>> > > > c)
>> > > > > > > > > could have a non-marginal CPU footprint on hardware with
>> > strict
>> > > > > > > > > resource constraints. All these elements could have an
>> impact
>> > > to
>> > > > > some
>> > > > > > > > > degree depending on the operational environment.
>> > > > > > > > >
>> > > > > > > > > From a design perspective, one question is where we want
>> the
>> > > > source
>> > > > > > of
>> > > > > > > > > truth w.r.t. remote log size to be during the lifetime of
>> a
>> > > > leader.
>> > > > > > > > > The responsibility of maintaining a consistent
>> representation
>> > > of
>> > > > > the
>> > > > > > > > > remote log is shared by Kafka and the plugin. Which
>> system is
>> > > > best
>> > > > > > > > > placed to maintain such a state while providing the
>> highest
>> > > > > > > > > consistency guarantees is something both Kafka and plugin
>> > > > designers
>> > > > > > > > > could help understand better.
>> > > > > > > > >
>> > > > > > > > > Many thanks,
>> > > > > > > > > Alexandre
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
>> > <jun@confluent.io.invalid
>> > > >
>> > > > a
>> > > > > > > > écrit :
>> > > > > > > > > >
>> > > > > > > > > > Hi, Divij,
>> > > > > > > > > >
>> > > > > > > > > > Thanks for the reply.
>> > > > > > > > > >
>> > > > > > > > > > Point #1. Is the average remote segment metadata really
>> > 1KB?
>> > > > > What's
>> > > > > > > > > listed
>> > > > > > > > > > in the public interface is probably well below 100
>> bytes.
>> > > > > > > > > >
>> > > > > > > > > > Point #2. I guess you are assuming that each broker only
>> > > caches
>> > > > > the
>> > > > > > > > > remote
>> > > > > > > > > > segment metadata in memory. An alternative approach is
>> to
>> > > cache
>> > > > > > them
>> > > > > > > in
>> > > > > > > > > > both memory and local disk. That way, on broker restart,
>> > you
>> > > > just
>> > > > > > > need
>> > > > > > > > to
>> > > > > > > > > > fetch the new remote segments' metadata using the
>> > > > > > > > > > listRemoteLogSegments(TopicIdPartition topicIdPartition,
>> > int
>> > > > > > > > leaderEpoch)
>> > > > > > > > > > api. Will that work?
>> > > > > > > > > >
>> > > > > > > > > > Point #3. Thanks for the explanation and it sounds good.
>> > > > > > > > > >
>> > > > > > > > > > Thanks,
>> > > > > > > > > >
>> > > > > > > > > > Jun
>> > > > > > > > > >
>> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij Vaidya <
>> > > > > > > divijvaidya13@gmail.com>
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Hi Jun
>> > > > > > > > > > >
>> > > > > > > > > > > There are three points that I would like to present
>> here:
>> > > > > > > > > > >
>> > > > > > > > > > > 1. We would require a large cache size to efficiently
>> > cache
>> > > > all
>> > > > > > > > segment
>> > > > > > > > > > > metadata.
>> > > > > > > > > > > 2. Linear scan of all metadata at broker startup to
>> > > populate
>> > > > > the
>> > > > > > > > cache
>> > > > > > > > > will
>> > > > > > > > > > > be slow and will impact the archival process.
>> > > > > > > > > > > 3. There is no other use case where a full scan of
>> > segment
>> > > > > > metadata
>> > > > > > > > is
>> > > > > > > > > > > required.
>> > > > > > > > > > >
>> > > > > > > > > > > Let's start by quantifying 1. Here's my estimate for
>> the
>> > > size
>> > > > > of
>> > > > > > > the
>> > > > > > > > > cache.
>> > > > > > > > > > > Average size of segment metadata = 1KB. This could be
>> > more
>> > > if
>> > > > > we
>> > > > > > > have
>> > > > > > > > > > > frequent leader failover with a large number of leader
>> > > epochs
>> > > > > > being
>> > > > > > > > > stored
>> > > > > > > > > > > per segment.
>> > > > > > > > > > > Segment size = 100MB. Users will prefer to reduce the
>> > > segment
>> > > > > > size
>> > > > > > > > > from the
>> > > > > > > > > > > default value of 1GB to ensure timely archival of data
>> > > since
>> > > > > data
>> > > > > > > > from
>> > > > > > > > > > > active segment is not archived.
>> > > > > > > > > > > Cache size = num segments * avg. segment metadata
>> size =
>> > > > > > > > > (100TB/100MB)*1KB
>> > > > > > > > > > > = 1GB.
>> > > > > > > > > > > While 1GB for cache may not sound like a large number
>> for
>> > > > > larger
>> > > > > > > > > machines,
>> > > > > > > > > > > it does eat into the memory as an additional cache and
>> > > makes
>> > > > > use
>> > > > > > > > cases
>> > > > > > > > > with
>> > > > > > > > > > > large data retention with low throughout expensive
>> (where
>> > > > such
>> > > > > > use
>> > > > > > > > case
>> > > > > > > > > > > would could use smaller machines).
>> > > > > > > > > > >
>> > > > > > > > > > > About point#2:
>> > > > > > > > > > > Even if we say that all segment metadata can fit into
>> the
>> > > > > cache,
>> > > > > > we
>> > > > > > > > > will
>> > > > > > > > > > > need to populate the cache on broker startup. It would
>> > not
>> > > be
>> > > > > in
>> > > > > > > the
>> > > > > > > > > > > critical patch of broker startup and hence won't
>> impact
>> > the
>> > > > > > startup
>> > > > > > > > > time.
>> > > > > > > > > > > But it will impact the time when we could start the
>> > > archival
>> > > > > > > process
>> > > > > > > > > since
>> > > > > > > > > > > the RLM thread pool will be blocked on the first call
>> to
>> > > > > > > > > > > listRemoteLogSegments(). To scan metadata for 1MM
>> > segments
>> > > > > > > (computed
>> > > > > > > > > above)
>> > > > > > > > > > > and transfer 1GB data over the network from a RLMM
>> such
>> > as
>> > > a
>> > > > > > remote
>> > > > > > > > > > > database would be in the order of minutes (depending
>> on
>> > how
>> > > > > > > efficient
>> > > > > > > > > the
>> > > > > > > > > > > scan is with the RLMM implementation). Although, I
>> would
>> > > > > concede
>> > > > > > > that
>> > > > > > > > > > > having RLM threads blocked for a few minutes is
>> perhaps
>> > OK
>> > > > but
>> > > > > if
>> > > > > > > we
>> > > > > > > > > > > introduce the new API proposed in the KIP, we would
>> have
>> > a
>> > > > > > > > > > > deterministic startup time for RLM. Adding the API
>> comes
>> > > at a
>> > > > > low
>> > > > > > > > cost
>> > > > > > > > > and
>> > > > > > > > > > > I believe the trade off is worth it.
>> > > > > > > > > > >
>> > > > > > > > > > > About point#3:
>> > > > > > > > > > > We can use listRemoteLogSegments(TopicIdPartition
>> > > > > > topicIdPartition,
>> > > > > > > > int
>> > > > > > > > > > > leaderEpoch) to calculate the segments eligible for
>> > > deletion
>> > > > > > (based
>> > > > > > > > on
>> > > > > > > > > size
>> > > > > > > > > > > retention) where leader epoch(s) belong to the current
>> > > leader
>> > > > > > epoch
>> > > > > > > > > chain.
>> > > > > > > > > > > I understand that it may lead to segments belonging to
>> > > other
>> > > > > > epoch
>> > > > > > > > > lineage
>> > > > > > > > > > > not getting deleted and would require a separate
>> > mechanism
>> > > to
>> > > > > > > delete
>> > > > > > > > > them.
>> > > > > > > > > > > The separate mechanism would anyways be required to
>> > delete
>> > > > > these
>> > > > > > > > > "leaked"
>> > > > > > > > > > > segments as there are other cases which could lead to
>> > leaks
>> > > > > such
>> > > > > > as
>> > > > > > > > > network
>> > > > > > > > > > > problems with RSM mid way writing through. segment
>> etc.
>> > > > > > > > > > >
>> > > > > > > > > > > Thank you for the replies so far. They have made me
>> > > re-think
>> > > > my
>> > > > > > > > > assumptions
>> > > > > > > > > > > and this dialogue has been very constructive for me.
>> > > > > > > > > > >
>> > > > > > > > > > > Regards,
>> > > > > > > > > > > Divij Vaidya
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
>> > > > > > <jun@confluent.io.invalid
>> > > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi, Divij,
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks for the reply.
>> > > > > > > > > > > >
>> > > > > > > > > > > > It's true that the data in Kafka could be kept
>> longer
>> > > with
>> > > > > > > KIP-405.
>> > > > > > > > > How
>> > > > > > > > > > > > much data do you envision to have per broker? For
>> 100TB
>> > > > data
>> > > > > > per
>> > > > > > > > > broker,
>> > > > > > > > > > > > with 1GB segment and segment metadata of 100 bytes,
>> it
>> > > > > requires
>> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit in memory.
>> > > > > > > > > > > >
>> > > > > > > > > > > > RemoteLogMetadataManager has two
>> > listRemoteLogSegments()
>> > > > > > methods.
>> > > > > > > > > The one
>> > > > > > > > > > > > you listed listRemoteLogSegments(TopicIdPartition
>> > > > > > > topicIdPartition,
>> > > > > > > > > int
>> > > > > > > > > > > > leaderEpoch) does return data in offset order.
>> However,
>> > > the
>> > > > > > other
>> > > > > > > > > > > > one listRemoteLogSegments(TopicIdPartition
>> > > > topicIdPartition)
>> > > > > > > > doesn't
>> > > > > > > > > > > > specify the return order. I assume that you need the
>> > > latter
>> > > > > to
>> > > > > > > > > calculate
>> > > > > > > > > > > > the segment size?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks,
>> > > > > > > > > > > >
>> > > > > > > > > > > > Jun
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij Vaidya <
>> > > > > > > > > divijvaidya13@gmail.com>
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > *Jun,*
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > *"the default implementation of RLMM does local
>> > > caching,
>> > > > > > > right?"*
>> > > > > > > > > > > > > Yes, Jun. The default implementation of RLMM does
>> > > indeed
>> > > > > > cache
>> > > > > > > > the
>> > > > > > > > > > > > segment
>> > > > > > > > > > > > > metadata today, hence, it won't work for use cases
>> > when
>> > > > the
>> > > > > > > > number
>> > > > > > > > > of
>> > > > > > > > > > > > > segments in remote storage is large enough to
>> exceed
>> > > the
>> > > > > size
>> > > > > > > of
>> > > > > > > > > cache.
>> > > > > > > > > > > > As
>> > > > > > > > > > > > > part of this KIP, I will implement the new
>> proposed
>> > API
>> > > > in
>> > > > > > the
>> > > > > > > > > default
>> > > > > > > > > > > > > implementation of RLMM but the underlying
>> > > implementation
>> > > > > will
>> > > > > > > > > still be
>> > > > > > > > > > > a
>> > > > > > > > > > > > > scan. I will pick up optimizing that in a separate
>> > PR.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > *"we also cache all segment metadata in the
>> brokers
>> > > > without
>> > > > > > > > > KIP-405. Do
>> > > > > > > > > > > > you
>> > > > > > > > > > > > > see a need to change that?"*
>> > > > > > > > > > > > > Please correct me if I am wrong here but we cache
>> > > > metadata
>> > > > > > for
>> > > > > > > > > segments
>> > > > > > > > > > > > > "residing in local storage". The size of the
>> current
>> > > > cache
>> > > > > > > works
>> > > > > > > > > fine
>> > > > > > > > > > > for
>> > > > > > > > > > > > > the scale of the number of segments that we
>> expect to
>> > > > store
>> > > > > > in
>> > > > > > > > > local
>> > > > > > > > > > > > > storage. After KIP-405, that cache will continue
>> to
>> > > store
>> > > > > > > > metadata
>> > > > > > > > > for
>> > > > > > > > > > > > > segments which are residing in local storage and
>> > hence,
>> > > > we
>> > > > > > > don't
>> > > > > > > > > need
>> > > > > > > > > > > to
>> > > > > > > > > > > > > change that. For segments which have been
>> offloaded
>> > to
>> > > > > remote
>> > > > > > > > > storage,
>> > > > > > > > > > > it
>> > > > > > > > > > > > > would rely on RLMM. Note that the scale of data
>> > stored
>> > > in
>> > > > > > RLMM
>> > > > > > > is
>> > > > > > > > > > > > different
>> > > > > > > > > > > > > from local cache because the number of segments is
>> > > > expected
>> > > > > > to
>> > > > > > > be
>> > > > > > > > > much
>> > > > > > > > > > > > > larger than what current implementation stores in
>> > local
>> > > > > > > storage.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > 2,3,4:
>> > RemoteLogMetadataManager.listRemoteLogSegments()
>> > > > > does
>> > > > > > > > > specify
>> > > > > > > > > > > the
>> > > > > > > > > > > > > order i.e. it returns the segments sorted by first
>> > > offset
>> > > > > in
>> > > > > > > > > ascending
>> > > > > > > > > > > > > order. I am copying the API docs for KIP-405 here
>> for
>> > > > your
>> > > > > > > > > reference
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > *Returns iterator of remote log segment metadata,
>> > > sorted
>> > > > by
>> > > > > > > > {@link
>> > > > > > > > > > > > > RemoteLogSegmentMetadata#startOffset()}
>> inascending
>> > > order
>> > > > > > which
>> > > > > > > > > > > contains
>> > > > > > > > > > > > > the given leader epoch. This is used by remote log
>> > > > > retention
>> > > > > > > > > management
>> > > > > > > > > > > > > subsystemto fetch the segment metadata for a given
>> > > leader
>> > > > > > > > > epoch.@param
>> > > > > > > > > > > > > topicIdPartition topic partition@param
>> leaderEpoch
>> > > > > > leader
>> > > > > > > > > > > > > epoch@return
>> > > > > > > > > > > > > Iterator of remote segments, sorted by start
>> offset
>> > in
>> > > > > > > ascending
>> > > > > > > > > > > order. *
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > *Luke,*
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > 5. Note that we are trying to optimize the
>> efficiency
>> > > of
>> > > > > size
>> > > > > > > > based
>> > > > > > > > > > > > > retention for remote storage. KIP-405 does not
>> > > introduce
>> > > > a
>> > > > > > new
>> > > > > > > > > config
>> > > > > > > > > > > for
>> > > > > > > > > > > > > periodically checking remote similar to
>> > > > > > > > > > > log.retention.check.interval.ms
>> > > > > > > > > > > > > which is applicable for remote storage. Hence, the
>> > > metric
>> > > > > > will
>> > > > > > > be
>> > > > > > > > > > > updated
>> > > > > > > > > > > > > at the time of invoking log retention check for
>> > remote
>> > > > tier
>> > > > > > > which
>> > > > > > > > > is
>> > > > > > > > > > > > > pending implementation today. We can perhaps come
>> > back
>> > > > and
>> > > > > > > update
>> > > > > > > > > the
>> > > > > > > > > > > > > metric description after the implementation of log
>> > > > > retention
>> > > > > > > > check
>> > > > > > > > > in
>> > > > > > > > > > > > > RemoteLogManager.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > --
>> > > > > > > > > > > > > Divij Vaidya
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke Chen <
>> > > > > showuon@gmail.com
>> > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > Hi Divij,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > One more question about the metric:
>> > > > > > > > > > > > > > I think the metric will be updated when
>> > > > > > > > > > > > > > (1) each time we run the log retention check
>> (that
>> > > is,
>> > > > > > > > > > > > > > log.retention.check.interval.ms)
>> > > > > > > > > > > > > > (2) When user explicitly call getRemoteLogSize
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Is that correct?
>> > > > > > > > > > > > > > Maybe we should add a note in metric
>> description,
>> > > > > > otherwise,
>> > > > > > > > when
>> > > > > > > > > > > user
>> > > > > > > > > > > > > got,
>> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes, will be
>> > surprised.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Otherwise, LGTM
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Thank you for the KIP
>> > > > > > > > > > > > > > Luke
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun Rao
>> > > > > > > > <jun@confluent.io.invalid
>> > > > > > > > > >
>> > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Hi, Divij,
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Thanks for the explanation.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > 1. Hmm, the default implementation of RLMM
>> does
>> > > local
>> > > > > > > > caching,
>> > > > > > > > > > > right?
>> > > > > > > > > > > > > > > Currently, we also cache all segment metadata
>> in
>> > > the
>> > > > > > > brokers
>> > > > > > > > > > > without
>> > > > > > > > > > > > > > > KIP-405. Do you see a need to change that?
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation makes sense.
>> > However,
>> > > > > > > > > > > > > > > currently,
>> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
>> > > > > > > > > doesn't
>> > > > > > > > > > > > > > specify
>> > > > > > > > > > > > > > > a particular order of the iterator. Do you
>> intend
>> > > to
>> > > > > > change
>> > > > > > > > > that?
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Jun
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM Divij Vaidya <
>> > > > > > > > > > > divijvaidya13@gmail.com
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Hey Jun
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Thank you for your comments.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > *1. "RLMM implementor could ensure that
>> > > > > > > > > listRemoteLogSegments()
>> > > > > > > > > > > is
>> > > > > > > > > > > > > > fast"*
>> > > > > > > > > > > > > > > > This would be ideal but pragmatically, it is
>> > > > > difficult
>> > > > > > to
>> > > > > > > > > ensure
>> > > > > > > > > > > > that
>> > > > > > > > > > > > > > > > listRemoteLogSegments() is fast. This is
>> > because
>> > > of
>> > > > > the
>> > > > > > > > > > > possibility
>> > > > > > > > > > > > > of
>> > > > > > > > > > > > > > a
>> > > > > > > > > > > > > > > > large number of segments (much larger than
>> what
>> > > > Kafka
>> > > > > > > > > currently
>> > > > > > > > > > > > > handles
>> > > > > > > > > > > > > > > > with local storage today) would make it
>> > > infeasible
>> > > > to
>> > > > > > > adopt
>> > > > > > > > > > > > > strategies
>> > > > > > > > > > > > > > > such
>> > > > > > > > > > > > > > > > as local caching to improve the performance
>> of
>> > > > > > > > > > > > listRemoteLogSegments.
>> > > > > > > > > > > > > > > Apart
>> > > > > > > > > > > > > > > > from caching (which won't work due to size
>> > > > > > limitations) I
>> > > > > > > > > can't
>> > > > > > > > > > > > think
>> > > > > > > > > > > > > > of
>> > > > > > > > > > > > > > > > other strategies which may eliminate the
>> need
>> > for
>> > > > IO
>> > > > > > > > > > > > > > > > operations proportional to the number of
>> total
>> > > > > > segments.
>> > > > > > > > > Please
>> > > > > > > > > > > > > advise
>> > > > > > > > > > > > > > if
>> > > > > > > > > > > > > > > > you have something in mind.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > 2.  "*If the size exceeds the retention
>> size,
>> > we
>> > > > need
>> > > > > > to
>> > > > > > > > > > > determine
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > subset of segments to delete to bring the
>> size
>> > > > within
>> > > > > > the
>> > > > > > > > > > > retention
>> > > > > > > > > > > > > > > limit.
>> > > > > > > > > > > > > > > > Do we need to call
>> > > > > > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
>> > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > > determine that?"*
>> > > > > > > > > > > > > > > > Yes, we need to call
>> listRemoteLogSegments() to
>> > > > > > determine
>> > > > > > > > > which
>> > > > > > > > > > > > > > segments
>> > > > > > > > > > > > > > > > should be deleted. But there is a difference
>> > with
>> > > > the
>> > > > > > use
>> > > > > > > > > case we
>> > > > > > > > > > > > are
>> > > > > > > > > > > > > > > > trying to optimize with this KIP. To
>> determine
>> > > the
>> > > > > > subset
>> > > > > > > > of
>> > > > > > > > > > > > segments
>> > > > > > > > > > > > > > > which
>> > > > > > > > > > > > > > > > would be deleted, we only read metadata for
>> > > > segments
>> > > > > > > which
>> > > > > > > > > would
>> > > > > > > > > > > be
>> > > > > > > > > > > > > > > deleted
>> > > > > > > > > > > > > > > > via the listRemoteLogSegments(). But to
>> > determine
>> > > > the
>> > > > > > > > > > > totalLogSize,
>> > > > > > > > > > > > > > which
>> > > > > > > > > > > > > > > > is required every time retention logic
>> based on
>> > > > size
>> > > > > > > > > executes, we
>> > > > > > > > > > > > > read
>> > > > > > > > > > > > > > > > metadata of *all* the segments in remote
>> > storage.
>> > > > > > Hence,
>> > > > > > > > the
>> > > > > > > > > > > number
>> > > > > > > > > > > > > of
>> > > > > > > > > > > > > > > > results returned by
>> > > > > > > > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
>> > > > > > > > > > > > > > *is
>> > > > > > > > > > > > > > > > different when we are calculating
>> totalLogSize
>> > > vs.
>> > > > > when
>> > > > > > > we
>> > > > > > > > > are
>> > > > > > > > > > > > > > > determining
>> > > > > > > > > > > > > > > > the subset of segments to delete.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > 3.
>> > > > > > > > > > > > > > > > *"Also, what about time-based retention? To
>> > make
>> > > > that
>> > > > > > > > > efficient,
>> > > > > > > > > > > do
>> > > > > > > > > > > > > we
>> > > > > > > > > > > > > > > need
>> > > > > > > > > > > > > > > > to make some additional interface
>> changes?"*No.
>> > > > Note
>> > > > > > that
>> > > > > > > > > time
>> > > > > > > > > > > > > > complexity
>> > > > > > > > > > > > > > > > to determine the segments for retention is
>> > > > different
>> > > > > > for
>> > > > > > > > time
>> > > > > > > > > > > based
>> > > > > > > > > > > > > vs.
>> > > > > > > > > > > > > > > > size based. For time based, the time
>> complexity
>> > > is
>> > > > a
>> > > > > > > > > function of
>> > > > > > > > > > > > the
>> > > > > > > > > > > > > > > number
>> > > > > > > > > > > > > > > > of segments which are "eligible for
>> deletion"
>> > > > (since
>> > > > > we
>> > > > > > > > only
>> > > > > > > > > read
>> > > > > > > > > > > > > > > metadata
>> > > > > > > > > > > > > > > > for segments which would be deleted)
>> whereas in
>> > > > size
>> > > > > > > based
>> > > > > > > > > > > > retention,
>> > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > time complexity is a function of "all
>> segments"
>> > > > > > available
>> > > > > > > > in
>> > > > > > > > > > > remote
>> > > > > > > > > > > > > > > storage
>> > > > > > > > > > > > > > > > (metadata of all segments needs to be read
>> to
>> > > > > calculate
>> > > > > > > the
>> > > > > > > > > total
>> > > > > > > > > > > > > > size).
>> > > > > > > > > > > > > > > As
>> > > > > > > > > > > > > > > > you may observe, this KIP will bring the
>> time
>> > > > > > complexity
>> > > > > > > > for
>> > > > > > > > > both
>> > > > > > > > > > > > > time
>> > > > > > > > > > > > > > > > based retention & size based retention to
>> the
>> > > same
>> > > > > > > > function.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > 4. Also, please note that this new API
>> > introduced
>> > > > in
>> > > > > > this
>> > > > > > > > KIP
>> > > > > > > > > > > also
>> > > > > > > > > > > > > > > enables
>> > > > > > > > > > > > > > > > us to provide a metric for total size of
>> data
>> > > > stored
>> > > > > in
>> > > > > > > > > remote
>> > > > > > > > > > > > > storage.
>> > > > > > > > > > > > > > > > Without the API, calculation of this metric
>> > will
>> > > > > become
>> > > > > > > > very
>> > > > > > > > > > > > > expensive
>> > > > > > > > > > > > > > > with
>> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
>> > > > > > > > > > > > > > > > I understand that your motivation here is to
>> > > avoid
>> > > > > > > > polluting
>> > > > > > > > > the
>> > > > > > > > > > > > > > > interface
>> > > > > > > > > > > > > > > > with optimization specific APIs and I will
>> > agree
>> > > > with
>> > > > > > > that
>> > > > > > > > > goal.
>> > > > > > > > > > > > But
>> > > > > > > > > > > > > I
>> > > > > > > > > > > > > > > > believe that this new API proposed in the
>> KIP
>> > > > brings
>> > > > > in
>> > > > > > > > > > > significant
>> > > > > > > > > > > > > > > > improvement and there is no other work
>> around
>> > > > > available
>> > > > > > > to
>> > > > > > > > > > > achieve
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > same
>> > > > > > > > > > > > > > > > performance.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Regards,
>> > > > > > > > > > > > > > > > Divij Vaidya
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM Jun Rao
>> > > > > > > > > <jun@confluent.io.invalid
>> > > > > > > > > > > >
>> > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Hi, Divij,
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry for the late
>> reply.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > The motivation of the KIP is to improve
>> the
>> > > > > > efficiency
>> > > > > > > of
>> > > > > > > > > size
>> > > > > > > > > > > > > based
>> > > > > > > > > > > > > > > > > retention. I am not sure the proposed
>> changes
>> > > are
>> > > > > > > enough.
>> > > > > > > > > For
>> > > > > > > > > > > > > > example,
>> > > > > > > > > > > > > > > if
>> > > > > > > > > > > > > > > > > the size exceeds the retention size, we
>> need
>> > to
>> > > > > > > determine
>> > > > > > > > > the
>> > > > > > > > > > > > > subset
>> > > > > > > > > > > > > > of
>> > > > > > > > > > > > > > > > > segments to delete to bring the size
>> within
>> > the
>> > > > > > > retention
>> > > > > > > > > > > limit.
>> > > > > > > > > > > > Do
>> > > > > > > > > > > > > > we
>> > > > > > > > > > > > > > > > need
>> > > > > > > > > > > > > > > > > to call
>> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
>> > > > > > > > to
>> > > > > > > > > > > > > determine
>> > > > > > > > > > > > > > > > that?
>> > > > > > > > > > > > > > > > > Also, what about time-based retention? To
>> > make
>> > > > that
>> > > > > > > > > efficient,
>> > > > > > > > > > > do
>> > > > > > > > > > > > > we
>> > > > > > > > > > > > > > > need
>> > > > > > > > > > > > > > > > > to make some additional interface changes?
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > An alternative approach is for the RLMM
>> > > > implementor
>> > > > > > to
>> > > > > > > > make
>> > > > > > > > > > > sure
>> > > > > > > > > > > > > > > > > that
>> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
>> > > > > > > is
>> > > > > > > > > fast
>> > > > > > > > > > > > > (e.g.,
>> > > > > > > > > > > > > > > with
>> > > > > > > > > > > > > > > > > local caching). This way, we could keep
>> the
>> > > > > interface
>> > > > > > > > > simple.
>> > > > > > > > > > > > Have
>> > > > > > > > > > > > > we
>> > > > > > > > > > > > > > > > > considered that?
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Jun
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28 AM Divij
>> Vaidya
>> > <
>> > > > > > > > > > > > > > divijvaidya13@gmail.com>
>> > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > Hey folks
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > Does anyone else have any thoughts on
>> this
>> > > > > before I
>> > > > > > > > > propose
>> > > > > > > > > > > > this
>> > > > > > > > > > > > > > for
>> > > > > > > > > > > > > > > a
>> > > > > > > > > > > > > > > > > > vote?
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > > Divij Vaidya
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57 PM Satish
>> > > Duggana
>> > > > <
>> > > > > > > > > > > > > > > > satish.duggana@gmail.com
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Thanks for the KIP Divij!
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > This is a nice improvement to avoid
>> > > > > recalculation
>> > > > > > > of
>> > > > > > > > > size.
>> > > > > > > > > > > > > > > Customized
>> > > > > > > > > > > > > > > > > > RLMMs
>> > > > > > > > > > > > > > > > > > > can implement the best possible
>> approach
>> > by
>> > > > > > caching
>> > > > > > > > or
>> > > > > > > > > > > > > > maintaining
>> > > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > size
>> > > > > > > > > > > > > > > > > > > in an efficient way. But this is not a
>> > big
>> > > > > > concern
>> > > > > > > > for
>> > > > > > > > > the
>> > > > > > > > > > > > > > default
>> > > > > > > > > > > > > > > > > topic
>> > > > > > > > > > > > > > > > > > > based RLMM as mentioned in the KIP.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > ~Satish.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at 18:48, Divij
>> > Vaidya
>> > > <
>> > > > > > > > > > > > > > > divijvaidya13@gmail.com>
>> > > > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Thank you for your review Luke.
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Reg: is that would the new
>> > > > > > `RemoteLogSizeBytes`
>> > > > > > > > > metric
>> > > > > > > > > > > > be a
>> > > > > > > > > > > > > > > > > > performance
>> > > > > > > > > > > > > > > > > > > > overhead? Although we move the
>> > > calculation
>> > > > > to a
>> > > > > > > > > seperate
>> > > > > > > > > > > > API,
>> > > > > > > > > > > > > > we
>> > > > > > > > > > > > > > > > > still
>> > > > > > > > > > > > > > > > > > > > can't assume users will implement a
>> > > > > > light-weight
>> > > > > > > > > method,
>> > > > > > > > > > > > > right?
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > This metric would be logged using
>> the
>> > > > > > information
>> > > > > > > > > that is
>> > > > > > > > > > > > > > already
>> > > > > > > > > > > > > > > > > being
>> > > > > > > > > > > > > > > > > > > > calculated for handling remote
>> > retention
>> > > > > logic,
>> > > > > > > > > hence, no
>> > > > > > > > > > > > > > > > additional
>> > > > > > > > > > > > > > > > > > work
>> > > > > > > > > > > > > > > > > > > > is required to calculate this
>> metric.
>> > > More
>> > > > > > > > > specifically,
>> > > > > > > > > > > > > > whenever
>> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls
>> getRemoteLogSize
>> > > > API,
>> > > > > > this
>> > > > > > > > > metric
>> > > > > > > > > > > > > would
>> > > > > > > > > > > > > > be
>> > > > > > > > > > > > > > > > > > > captured.
>> > > > > > > > > > > > > > > > > > > > This API call is made every time
>> > > > > > RemoteLogManager
>> > > > > > > > > wants
>> > > > > > > > > > > to
>> > > > > > > > > > > > > > handle
>> > > > > > > > > > > > > > > > > > expired
>> > > > > > > > > > > > > > > > > > > > remote log segments (which should be
>> > > > > periodic).
>> > > > > > > > Does
>> > > > > > > > > that
>> > > > > > > > > > > > > > address
>> > > > > > > > > > > > > > > > > your
>> > > > > > > > > > > > > > > > > > > > concern?
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Divij Vaidya
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at 11:01 AM
>> Luke
>> > > Chen
>> > > > <
>> > > > > > > > > > > > > showuon@gmail.com>
>> > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Hi Divij,
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP!
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > I think it makes sense to delegate
>> > the
>> > > > > > > > > responsibility
>> > > > > > > > > > > of
>> > > > > > > > > > > > > > > > > calculation
>> > > > > > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > > > > specific RemoteLogMetadataManager
>> > > > > > > implementation.
>> > > > > > > > > > > > > > > > > > > > > But one thing I'm not quite sure,
>> is
>> > > that
>> > > > > > would
>> > > > > > > > > the new
>> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes` metric be a
>> > > > > performance
>> > > > > > > > > overhead?
>> > > > > > > > > > > > > > > > > > > > > Although we move the calculation
>> to a
>> > > > > > seperate
>> > > > > > > > > API, we
>> > > > > > > > > > > > > still
>> > > > > > > > > > > > > > > > can't
>> > > > > > > > > > > > > > > > > > > assume
>> > > > > > > > > > > > > > > > > > > > > users will implement a
>> light-weight
>> > > > method,
>> > > > > > > > right?
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Thank you.
>> > > > > > > > > > > > > > > > > > > > > Luke
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at 5:47 PM
>> Divij
>> > > > > Vaidya <
>> > > > > > > > > > > > > > > > > divijvaidya13@gmail.com
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > Hey folks
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > Please take a look at this KIP
>> > which
>> > > > > > proposes
>> > > > > > > > an
>> > > > > > > > > > > > > extension
>> > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > > > > > KIP-405.
>> > > > > > > > > > > > > > > > > > > > > This
>> > > > > > > > > > > > > > > > > > > > > > is my first KIP with Apache
>> Kafka
>> > > > > community
>> > > > > > > so
>> > > > > > > > > any
>> > > > > > > > > > > > > feedback
>> > > > > > > > > > > > > > > > would
>> > > > > > > > > > > > > > > > > > be
>> > > > > > > > > > > > > > > > > > > > > highly
>> > > > > > > > > > > > > > > > > > > > > > appreciated.
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > Cheers!
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
>> > > > > > > > > > > > > > > > > > > > > > Sr. Software Engineer
>> > > > > > > > > > > > > > > > > > > > > > Amazon
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>