You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Paulo Cândido (JIRA)" <ji...@apache.org> on 2017/01/08 14:02:58 UTC

[jira] [Created] (SPARK-19125) Streaming Duration by Count

Paulo Cândido created SPARK-19125:
-------------------------------------

Summary: Streaming Duration by Count
Key: SPARK-19125
URL: https://issues.apache.org/jira/browse/SPARK-19125
Project: Spark
Issue Type: Improvement
Components: DStreams
Environment: Java
Reporter: Paulo Cândido

I use the Spark Streaming in scientific way. In this cases, we have to run the same experiment many times using the same seed to obtain the same result. All randomic components have the seed as input, so I can controll it. However, there is a unique component that doesn't depend of seeds and we can't controll, it's the bach size. Regardless of the input way of stream, the metric to break the microbaches is wall time. It's a problem in scientific environment because if we run the same experiments with same param many times, each time we can get a diferent result, depending the quantity of elements read in each bach. The same stream source may generate diferent bach sizes on multiple executions because of wall time.

My sugestion is provide a new Duration metric: Count of Elements.

Regardless of time spent to fill a microbatch, they will be always the same size, and when the source has a seed to generate de same values, independent of throughput, we will can replicate the experiments with same result.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org