You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Paulo Cândido (JIRA)" <ji...@apache.org> on 2017/01/08 14:02:58 UTC

[jira] [Created] (SPARK-19125) Streaming Duration by Count

Paulo Cândido created SPARK-19125:
-------------------------------------

             Summary: Streaming Duration by Count
                 Key: SPARK-19125
                 URL: https://issues.apache.org/jira/browse/SPARK-19125
             Project: Spark
          Issue Type: Improvement
          Components: DStreams
         Environment: Java
            Reporter: Paulo Cândido


I use the Spark Streaming in scientific way. In this cases, we have to run the same experiment many times using the same seed to obtain the same result. All randomic components have the seed as input, so I can controll it. However, there is a unique component that doesn't depend of seeds and we can't controll, it's the bach size. Regardless of the input way of stream, the metric to break the microbaches is wall time. It's a problem in scientific environment because if we run the same experiments with same param many times, each time we can get a diferent result, depending the quantity of elements read in each bach. The same stream source may generate diferent bach sizes on multiple executions because of wall time.

My sugestion is provide a new Duration metric: Count of Elements.

Regardless of time spent to fill a microbatch, they will be always the same size, and when the source has a seed to generate de same values, independent of throughput, we will can replicate the experiments with same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org