You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/08/18 17:08:21 UTC

[jira] [Commented] (FLINK-3660) Measure latency of elements and expose it through web interface

    [ https://issues.apache.org/jira/browse/FLINK-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426825#comment-15426825 ] 

ASF GitHub Bot commented on FLINK-3660:
---------------------------------------

GitHub user rmetzger opened a pull request:

    https://github.com/apache/flink/pull/2386

    [FLINK-3660] Measure latency and exposes them via a metric

    This commit adds the initial runtime support for measuring latency of records going through the system.
    
    I therefore introduced a new StreamElement, called a LatencyMarker.
    Similar to Watermarks, LatencyMarkers are emitted from the sources at an configured interval. The default value for the interval is 2000 ms. The emission of markers can be disabled by setting the interval to 0. LatencyMarkers can not "overtake" regular elements. This ensures that the measured latency approximates the end-to-end latency of regular stream elements.
    
    Regular operators (excluding those participating in iterations) forward latency markers if they are not a sink.
    Operators with many outputs randomly select one to forward the maker to. This ensures that every marker exists only once in the system, and that repartition steps are not causing an explosion in the number of transferred markers.
    If an operator is a sink, it will maintain the last 512 latencies from each known source instance.
    The min/max/mean/p50/p95/p99 of each known source is reported using a special LatencyGauge from the sink (every operator can be a sink, if it doesn't have any outputs).
    
    This commit does not visualize the latency in the web interface.
    Also, there is currently no mechanism to ensure that the system clocks are in-sync, so the latency measurements will be inaccurate if the hardware clocks are not correct.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rmetzger/flink flink3660-pr

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2386.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2386
    
----

----


> Measure latency of elements and expose it through web interface
> ---------------------------------------------------------------
>
>                 Key: FLINK-3660
>                 URL: https://issues.apache.org/jira/browse/FLINK-3660
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Streaming
>            Reporter: Robert Metzger
>            Assignee: Robert Metzger
>             Fix For: pre-apache
>
>
> It would be nice to expose the end-to-end latency of a streaming job in the webinterface.
> To achieve this, my initial thought was to attach an ingestion-time timestamp at the sources to each record.
> However, this introduces overhead for a monitoring feature users might not even use (8 bytes for each element + System.currentTimeMilis() on each element).
> Therefore, I suggest to implement this feature by periodically sending special events, similar to watermarks through the topology. 
> Those {{LatencyMarks}} are emitted at a configurable interval at the sources and forwarded by the tasks. The sinks will compare the timestamp of the latency marks with their current system time to determine the latency.
> The latency marks will not add to the latency of a job, but the marks will be delayed similarly than regular records, so their latency will approximate the record latency.
> Above suggestion expects the clocks on all taskmanagers to be in sync. Otherwise, the measured latencies would also include the offsets between the taskmanager's clocks.
> In a second step, we can try to mitigate the issue by using the JobManager as a central timing service. The TaskManagers will periodically query the JM for the current time in order to determine the offset with their clock.
> This offset would still include the network latency between TM and JM but it would still lead to reasonably good estimations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)