You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/05/12 01:54:00 UTC

[jira] [Commented] (SAMZA-1670) When fetching a newest offset for a partition, also prefetch and cache the newest offsets for other partitions on the container

    [ https://issues.apache.org/jira/browse/SAMZA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472871#comment-16472871 ] 

ASF GitHub Bot commented on SAMZA-1670:
---------------------------------------

GitHub user cameronlee314 opened a pull request:

    https://github.com/apache/samza/pull/520

    SAMZA-1670 : When fetching a newest offset for a partition, also prefetch and cache the newest offsets for other partitions on the container

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cameronlee314/samza partition_metadata

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/samza/pull/520.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #520
    
----
commit 20c483192ba0a378f02ea38afa9729a81ee74b33
Author: Cameron Lee <ca...@...>
Date:   2018-05-11T22:18:06Z

    SAMZA-1670 : When fetching a newest offset for a partition, also prefetch and cache the newest offsets for other partitions on the container

----


> When fetching a newest offset for a partition, also prefetch and cache the newest offsets for other partitions on the container
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SAMZA-1670
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1670
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Cameron Lee
>            Priority: Major
>              Labels: metadata
>
> ExtendedSystemAdmin.getNewestOffset current just works on one system-stream-partition at a time. As an optimization, when one system-stream-partition needs a newest offset, a batch call can be leveraged to also fetch newest offsets (and cache the data) for other partitions on the same container.
> This can help to reduce the call volume to system admins to get newest offset metadata. This can also help reduce contention on system admins when metadata is needed by multiple threads at the same time.
> *Proposed approach:*
> Add a new getNewestOffset API to StreamMetadataCache. Have the cache keep track of all system-stream-partitions that have asked for newest offsets before, and when a system-stream-partition needs newest offset metadata, check if there are any other stale entries and fetch those as well. This also requires adding a getNewestOffsets batch call to ExtendedSystemAdmin. The benefit here is that StreamMetadataCache is already reused by multiple tasks, but the disadvantage is that it has to keep track of new state.
> *Alternative approach:*
> Collect all system-stream-partitions that will need newest offset metadata at setup, and then make the batch call whenever any of those partitions needs metadata and cache the metadata. The benefit for this approach is that no state needs to be built up, as it is known at setup, but it might be unclean to do the initial collection and keep track of it. For example, it might be necessary to store container-granular information inside partition-granular objects (e.g. TaskStorageManager).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)