You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/01/08 14:10:58 UTC

[jira] [Commented] (SPARK-19125) Streaming Duration by Count

    [ https://issues.apache.org/jira/browse/SPARK-19125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15809449#comment-15809449 ] 

Sean Owen commented on SPARK-19125:
-----------------------------------

That's so at odds with the architecture that I don't think that would ever be implemented. What you have isn't really a streaming problem anymore if you don't care about time so much as waiting for n events.

> Streaming Duration by Count
> ---------------------------
>
>                 Key: SPARK-19125
>                 URL: https://issues.apache.org/jira/browse/SPARK-19125
>             Project: Spark
>          Issue Type: Improvement
>          Components: DStreams
>         Environment: Java
>            Reporter: Paulo Cândido
>
> I use the Spark Streaming in scientific way. In this cases, we have to run the same experiment many times using the same seed to obtain the same result. All randomic components have the seed as input, so I can controll it. However, there is a unique component that doesn't depend of seeds and we can't controll, it's the bach size. Regardless of the input way of stream, the metric to break the microbaches is wall time. It's a problem in scientific environment because if we run the same experiments with same param many times, each time we can get a diferent result, depending the quantity of elements read in each bach. The same stream source may generate diferent bach sizes on multiple executions because of wall time.
> My sugestion is provide a new Duration metric: Count of Elements.
> Regardless of time spent to fill a microbatch, they will be always the same size, and when the source has a seed to generate de same values, independent of throughput, we will can replicate the experiments with same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org