You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Emre Sevinc <em...@gmail.com> on 2015/04/08 10:44:43 UTC

Which method do you think is better for making MIN_REMEMBER_DURATION configurable?

Hello,

This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is
now a constant) a variable (configurable, with a default value). Before
spending effort on developing something and creating a pull request, I
wanted to consult with the core developers to see which approach makes most
sense, and has the higher probability of being accepted.

The constant MIN_REMEMBER_DURATION can be seen at:


https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338

it is marked as private member of private[streaming] object
FileInputDStream.

Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of
minRememberDuration, and then  add a new fileStream method to
JavaStreamingContext.scala :


https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala

such that the new fileStream method accepts a new parameter, e.g.
minRememberDuration: Int (in seconds), and then use this value to set the
private minRememberDuration.


Approach 2: Create a new, public Spark configuration property, e.g. named
spark.rememberDuration.min (with a default value of 60 seconds), and then
set the private variable minRememberDuration to the value of this Spark
property.


Approach 1 would mean adding a new method to the public API, Approach 2
would mean creating a new public Spark property. Right now, approach 2
seems more straightforward and simpler to me, but nevertheless I wanted to
have the opinions of other developers who know the internals of Spark
better than I do.

Kind regards,
Emre Sevinç

Re: Which method do you think is better for making MIN_REMEMBER_DURATION configurable?

Posted by Jeremy Freeman <fr...@gmail.com>.
+1 for this feature

In our use case, we probably wouldn’t use this feature in production, but it can be useful during prototyping and algorithm development to repeatedly perform the same streaming operation on a fixed, already existing set of files.

-------------------------
jeremyfreeman.net
@thefreemanlab

On Apr 8, 2015, at 2:51 PM, Emre Sevinc <em...@gmail.com> wrote:

> Tathagata,
> 
> Thanks for stating your preference for Approach 2.
> 
> My use case and motivation are similar to the concerns raised by others in
> SPARK-3276. In previous versions of Spark, e.g. 1.1.x we had the ability
> for Spark Streaming applications to process the files in an input directory
> that existed before the streaming application began, and for some projects
> that we did for our customers, we relied on that feature. Starting from
> 1.2.x series, we are limited in this respect to the files whose time stamp
> is not older than 1 minute. The only workaround is to 'touch' those files
> before starting a streaming application.
> 
> Moreover, this MIN_REMEMBER_DURATION is set to an arbitrary value of 1
> minute, and I don't see any argument why it cannot be set to another
> arbitrary value (keeping the default value of 1 minute, if nothing is set
> by the user).
> 
> Putting all this together, my plan is to create a Pull Request that is like
> 
>  1- Convert "private val MIN_REMEMBER_DURATION" into "private val
> minRememberDuration" (to reflect the change that it is not a constant in
> the sense that it can be set via configuration)
> 
>  2- Set its value by using something like
> getConf("spark.streaming.minRememberDuration", Minutes(1))
> 
>  3- Document the spark.streaming.minRememberDuration in Spark Streaming
> Programming Guide
> 
> If the above sounds fine, then I'll go on implementing this small change
> and submit a pull request for fixing SPARK-3276.
> 
> What do you say?
> 
> Kind regards,
> 
> Emre Sevinç
> http://www.bigindustries.be/
> 
> 
> On Wed, Apr 8, 2015 at 7:16 PM, Tathagata Das <td...@databricks.com> wrote:
> 
>> Approach 2 is definitely better  :)
>> Can you tell us more about the use case why you want to do this?
>> 
>> TD
>> 
>> On Wed, Apr 8, 2015 at 1:44 AM, Emre Sevinc <em...@gmail.com> wrote:
>> 
>>> Hello,
>>> 
>>> This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is
>>> now a constant) a variable (configurable, with a default value). Before
>>> spending effort on developing something and creating a pull request, I
>>> wanted to consult with the core developers to see which approach makes
>>> most
>>> sense, and has the higher probability of being accepted.
>>> 
>>> The constant MIN_REMEMBER_DURATION can be seen at:
>>> 
>>> 
>>> 
>>> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338
>>> 
>>> it is marked as private member of private[streaming] object
>>> FileInputDStream.
>>> 
>>> Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of
>>> minRememberDuration, and then  add a new fileStream method to
>>> JavaStreamingContext.scala :
>>> 
>>> 
>>> 
>>> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
>>> 
>>> such that the new fileStream method accepts a new parameter, e.g.
>>> minRememberDuration: Int (in seconds), and then use this value to set the
>>> private minRememberDuration.
>>> 
>>> 
>>> Approach 2: Create a new, public Spark configuration property, e.g. named
>>> spark.rememberDuration.min (with a default value of 60 seconds), and then
>>> set the private variable minRememberDuration to the value of this Spark
>>> property.
>>> 
>>> 
>>> Approach 1 would mean adding a new method to the public API, Approach 2
>>> would mean creating a new public Spark property. Right now, approach 2
>>> seems more straightforward and simpler to me, but nevertheless I wanted to
>>> have the opinions of other developers who know the internals of Spark
>>> better than I do.
>>> 
>>> Kind regards,
>>> Emre Sevinç
>>> 
>> 
>> 
> 
> 
> -- 
> Emre Sevinc


Re: Which method do you think is better for making MIN_REMEMBER_DURATION configurable?

Posted by Emre Sevinc <em...@gmail.com>.
Tathagata,

Thanks for stating your preference for Approach 2.

My use case and motivation are similar to the concerns raised by others in
SPARK-3276. In previous versions of Spark, e.g. 1.1.x we had the ability
for Spark Streaming applications to process the files in an input directory
that existed before the streaming application began, and for some projects
that we did for our customers, we relied on that feature. Starting from
1.2.x series, we are limited in this respect to the files whose time stamp
is not older than 1 minute. The only workaround is to 'touch' those files
before starting a streaming application.

Moreover, this MIN_REMEMBER_DURATION is set to an arbitrary value of 1
minute, and I don't see any argument why it cannot be set to another
arbitrary value (keeping the default value of 1 minute, if nothing is set
by the user).

Putting all this together, my plan is to create a Pull Request that is like

  1- Convert "private val MIN_REMEMBER_DURATION" into "private val
minRememberDuration" (to reflect the change that it is not a constant in
the sense that it can be set via configuration)

  2- Set its value by using something like
getConf("spark.streaming.minRememberDuration", Minutes(1))

  3- Document the spark.streaming.minRememberDuration in Spark Streaming
Programming Guide

If the above sounds fine, then I'll go on implementing this small change
and submit a pull request for fixing SPARK-3276.

What do you say?

Kind regards,

Emre Sevinç
http://www.bigindustries.be/


On Wed, Apr 8, 2015 at 7:16 PM, Tathagata Das <td...@databricks.com> wrote:

> Approach 2 is definitely better  :)
> Can you tell us more about the use case why you want to do this?
>
> TD
>
> On Wed, Apr 8, 2015 at 1:44 AM, Emre Sevinc <em...@gmail.com> wrote:
>
>> Hello,
>>
>> This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is
>> now a constant) a variable (configurable, with a default value). Before
>> spending effort on developing something and creating a pull request, I
>> wanted to consult with the core developers to see which approach makes
>> most
>> sense, and has the higher probability of being accepted.
>>
>> The constant MIN_REMEMBER_DURATION can be seen at:
>>
>>
>>
>> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338
>>
>> it is marked as private member of private[streaming] object
>> FileInputDStream.
>>
>> Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of
>> minRememberDuration, and then  add a new fileStream method to
>> JavaStreamingContext.scala :
>>
>>
>>
>> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
>>
>> such that the new fileStream method accepts a new parameter, e.g.
>> minRememberDuration: Int (in seconds), and then use this value to set the
>> private minRememberDuration.
>>
>>
>> Approach 2: Create a new, public Spark configuration property, e.g. named
>> spark.rememberDuration.min (with a default value of 60 seconds), and then
>> set the private variable minRememberDuration to the value of this Spark
>> property.
>>
>>
>> Approach 1 would mean adding a new method to the public API, Approach 2
>> would mean creating a new public Spark property. Right now, approach 2
>> seems more straightforward and simpler to me, but nevertheless I wanted to
>> have the opinions of other developers who know the internals of Spark
>> better than I do.
>>
>> Kind regards,
>> Emre Sevinç
>>
>
>


-- 
Emre Sevinc

Re: Which method do you think is better for making MIN_REMEMBER_DURATION configurable?

Posted by Tathagata Das <td...@databricks.com>.
Approach 2 is definitely better  :)
Can you tell us more about the use case why you want to do this?

TD

On Wed, Apr 8, 2015 at 1:44 AM, Emre Sevinc <em...@gmail.com> wrote:

> Hello,
>
> This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is
> now a constant) a variable (configurable, with a default value). Before
> spending effort on developing something and creating a pull request, I
> wanted to consult with the core developers to see which approach makes most
> sense, and has the higher probability of being accepted.
>
> The constant MIN_REMEMBER_DURATION can be seen at:
>
>
>
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338
>
> it is marked as private member of private[streaming] object
> FileInputDStream.
>
> Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of
> minRememberDuration, and then  add a new fileStream method to
> JavaStreamingContext.scala :
>
>
>
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
>
> such that the new fileStream method accepts a new parameter, e.g.
> minRememberDuration: Int (in seconds), and then use this value to set the
> private minRememberDuration.
>
>
> Approach 2: Create a new, public Spark configuration property, e.g. named
> spark.rememberDuration.min (with a default value of 60 seconds), and then
> set the private variable minRememberDuration to the value of this Spark
> property.
>
>
> Approach 1 would mean adding a new method to the public API, Approach 2
> would mean creating a new public Spark property. Right now, approach 2
> seems more straightforward and simpler to me, but nevertheless I wanted to
> have the opinions of other developers who know the internals of Spark
> better than I do.
>
> Kind regards,
> Emre Sevinç
>