You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@samza.apache.org by "Cameron Lee (JIRA)" <ji...@apache.org> on 2018/04/20 02:29:00 UTC

[jira] [Created] (SAMZA-1670) When fetching a newest offset for a partition, also prefetch and cache the newest offsets for other partitions on the container

Cameron Lee created SAMZA-1670:
----------------------------------

Summary: When fetching a newest offset for a partition, also prefetch and cache the newest offsets for other partitions on the container
Key: SAMZA-1670
URL: https://issues.apache.org/jira/browse/SAMZA-1670
Project: Samza
Issue Type: Improvement
Reporter: Cameron Lee

ExtendedSystemAdmin.getNewestOffset current just works on one system-stream-partition at a time. As an optimization, when one system-stream-partition needs a newest offset, a batch call can be leveraged to also fetch newest offsets (and cache the data) for other partitions on the same container.

This can help to reduce the call volume to system admins to get newest offset metadata. This can also help reduce contention on system admins when metadata is needed by multiple threads at the same time.

*Proposed approach:*

Add a new getNewestOffset API to StreamMetadataCache. Have the cache keep track of all system-stream-partitions that have asked for newest offsets before, and when a system-stream-partition needs newest offset metadata, check if there are any other stale entries and fetch those as well. This also requires adding a getNewestOffsets batch call to ExtendedSystemAdmin. The benefit here is that StreamMetadataCache is already reused by multiple tasks, but the disadvantage is that it has to keep track of new state.

*Alternative approach:*

Collect all system-stream-partitions that will need newest offset metadata at setup, and then make the batch call whenever any of those partitions needs metadata and cache the metadata. The benefit for this approach is that no state needs to be built up, as it is known at setup, but it might be unclean to do the initial collection and keep track of it. For example, it might be necessary to store container-granular information inside partition-granular objects (e.g. TaskStorageManager).

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)