You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Adrian Mocanu <am...@verticalscope.com> on 2014/02/26 18:34:52 UTC

window every n elements instead of time based

Hi
Is there a way to do window processing but not based on time but every 6 items going through the stream?

Example:
Window of size 3 with 1 item "duration"
Stream data: 1,2,3,4,5,6,7
[1,2,3]=window 1
[2,3,4]=window 2
[3,4,5]=window 2
etc

-Adrian

Re: window every n elements instead of time based

Posted by Michael Allman <mi...@videoamp.com>.

Yes, I meant batch interval. Thanks for clarifying.

Cheers,

Michael


On Oct 7, 2014, at 11:14 PM, jayant [via Apache Spark User List] <ml...@n3.nabble.com> wrote:

> Hi Michael,
> 
> I think you are meaning batch interval instead of windowing. It can be helpful for cases when you do not want to process very small batch sizes.
> 
> HDFS sink in Flume has the concept of rolling files based on time, number of events or size.
> https://flume.apache.org/FlumeUserGuide.html#hdfs-sink
> 
> The same could be applied to Spark if the use cases demand. The only major catch would be that it breaks the concept of window operations which are in Spark.
> 
> Thanks,
> Jayant
> 
> 
> 
> 
> On Tue, Oct 7, 2014 at 10:19 PM, Michael Allman <[hidden email]> wrote:
> Hi Andrew,
> 
> The use case I have in mind is batch data serialization to HDFS, where sizing files to a certain HDFS block size is desired. In my particular use case, I want to process 10GB batches of data at a time. I'm not sure this is a sensible use case for spark streaming, and I was trying to test it. However, I had trouble getting it working and in the end I decided it was more trouble than it was worth. So I decided to split my task into two: one streaming job on small, time-defined batches of data, and a traditional Spark job aggregating the smaller files into a larger whole. In retrospect, I think this is the right way to go, even if a count-based window specification was possible. Therefore, I can't suggest my use case for a count-based window size.
> 
> Cheers,
> 
> Michael
> 
> On Oct 5, 2014, at 4:03 PM, Andrew Ash <[hidden email]> wrote:
> 
>> Hi Michael,
>> 
>> I couldn't find anything in Jira for it -- https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming
>> 
>> Could you or Adrian please file a Jira ticket explaining the functionality and maybe a proposed API?  This will help people interested in count-based windowing to understand the state of the feature in Spark Streaming.
>> 
>> Thanks!
>> Andrew
>> 
>> On Fri, Oct 3, 2014 at 4:09 PM, Michael Allman <[hidden email]> wrote:
>> Hi,
>> 
>> I also have a use for count-based windowing. I'd like to process data
>> batches by size as opposed to time. Is this feature on the development
>> roadmap? Is there a JIRA ticket for it?
>> 
>> Thank you,
>> 
>> Michael
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15701.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>> 
>> 
> 
> 
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15904.html
> To unsubscribe from window every n elements instead of time based, click here.
> NAML





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15905.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: window every n elements instead of time based

Posted by Jayant Shekhar <ja...@cloudera.com>.

Hi Michael,

I think you are meaning batch interval instead of windowing. It can be
helpful for cases when you do not want to process very small batch sizes.

HDFS sink in Flume has the concept of rolling files based on time, number
of events or size.
https://flume.apache.org/FlumeUserGuide.html#hdfs-sink

The same could be applied to Spark if the use cases demand. The only major
catch would be that it breaks the concept of window operations which are in
Spark.

Thanks,
Jayant




On Tue, Oct 7, 2014 at 10:19 PM, Michael Allman <mi...@videoamp.com>
wrote:

> Hi Andrew,
>
> The use case I have in mind is batch data serialization to HDFS, where
> sizing files to a certain HDFS block size is desired. In my particular use
> case, I want to process 10GB batches of data at a time. I'm not sure this
> is a sensible use case for spark streaming, and I was trying to test it.
> However, I had trouble getting it working and in the end I decided it was
> more trouble than it was worth. So I decided to split my task into two: one
> streaming job on small, time-defined batches of data, and a traditional
> Spark job aggregating the smaller files into a larger whole. In retrospect,
> I think this is the right way to go, even if a count-based window
> specification was possible. Therefore, I can't suggest my use case for a
> count-based window size.
>
> Cheers,
>
> Michael
>
> On Oct 5, 2014, at 4:03 PM, Andrew Ash <an...@andrewash.com> wrote:
>
> Hi Michael,
>
> I couldn't find anything in Jira for it --
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming
>
> Could you or Adrian please file a Jira ticket explaining the functionality
> and maybe a proposed API?  This will help people interested in count-based
> windowing to understand the state of the feature in Spark Streaming.
>
> Thanks!
> Andrew
>
> On Fri, Oct 3, 2014 at 4:09 PM, Michael Allman <mi...@videoamp.com>
> wrote:
>
>> Hi,
>>
>> I also have a use for count-based windowing. I'd like to process data
>> batches by size as opposed to time. Is this feature on the development
>> roadmap? Is there a JIRA ticket for it?
>>
>> Thank you,
>>
>> Michael
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15701.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>

Re: window every n elements instead of time based

Posted by Michael Allman <mi...@videoamp.com>.

Hi Andrew,

The use case I have in mind is batch data serialization to HDFS, where sizing files to a certain HDFS block size is desired. In my particular use case, I want to process 10GB batches of data at a time. I'm not sure this is a sensible use case for spark streaming, and I was trying to test it. However, I had trouble getting it working and in the end I decided it was more trouble than it was worth. So I decided to split my task into two: one streaming job on small, time-defined batches of data, and a traditional Spark job aggregating the smaller files into a larger whole. In retrospect, I think this is the right way to go, even if a count-based window specification was possible. Therefore, I can't suggest my use case for a count-based window size.

Cheers,

Michael

On Oct 5, 2014, at 4:03 PM, Andrew Ash <an...@andrewash.com> wrote:

> Hi Michael,
> 
> I couldn't find anything in Jira for it -- https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming
> 
> Could you or Adrian please file a Jira ticket explaining the functionality and maybe a proposed API?  This will help people interested in count-based windowing to understand the state of the feature in Spark Streaming.
> 
> Thanks!
> Andrew
> 
> On Fri, Oct 3, 2014 at 4:09 PM, Michael Allman <mi...@videoamp.com> wrote:
> Hi,
> 
> I also have a use for count-based windowing. I'd like to process data
> batches by size as opposed to time. Is this feature on the development
> roadmap? Is there a JIRA ticket for it?
> 
> Thank you,
> 
> Michael
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15701.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
>

Re: window every n elements instead of time based

Posted by Andrew Ash <an...@andrewash.com>.

Hi Michael,

I couldn't find anything in Jira for it --
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming

Could you or Adrian please file a Jira ticket explaining the functionality
and maybe a proposed API?  This will help people interested in count-based
windowing to understand the state of the feature in Spark Streaming.

Thanks!
Andrew

On Fri, Oct 3, 2014 at 4:09 PM, Michael Allman <mi...@videoamp.com> wrote:

> Hi,
>
> I also have a use for count-based windowing. I'd like to process data
> batches by size as opposed to time. Is this feature on the development
> roadmap? Is there a JIRA ticket for it?
>
> Thank you,
>
> Michael
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15701.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: window every n elements instead of time based

Posted by Michael Allman <mi...@videoamp.com>.

Hi,

I also have a use for count-based windowing. I'd like to process data
batches by size as opposed to time. Is this feature on the development
roadmap? Is there a JIRA ticket for it?

Thank you,

Michael



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15701.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: window every n elements instead of time based

Posted by Tathagata Das <ta...@gmail.com>.

Well, it has been +1'd in our heads. ;) We will keep this in mind.

TD



On Thu, Feb 27, 2014 at 7:45 AM, Adrian Mocanu <am...@verticalscope.com>wrote:

>  If there is somewhere I can +1 this feature let me know.
>
>
>
> My use case is for financial indicators (math formulas) and a lot of them
> go by window count like moving average.
>
>
>
> Thanks
>
> A
>
>
>
> *From:* Tathagata Das [mailto:tathagata.das1565@gmail.com]
> *Sent:* February-26-14 2:05 PM
> *To:* user@spark.apache.org
> *Cc:* user@spark.incubator.apache.org
> *Subject:* Re: window every n elements instead of time based
>
>
>
> Currently, all in-built DStream operation is time-based windowing. We may
> provide count-based windowing in the future.
>
>
>
> On Wed, Feb 26, 2014 at 9:34 AM, Adrian Mocanu <am...@verticalscope.com>
> wrote:
>
>  Hi
>
> Is there a way to do window processing but not based on time but every 6
> items going through the stream?
>
>
>
> Example:
>
> Window of size 3 with 1 item "duration"
>
> Stream data: 1,2,3,4,5,6,7
>
> [1,2,3]=window 1
>
> [2,3,4]=window 2
>
> [3,4,5]=window 2
>
> etc
>
>
>
> -Adrian
>
>
>
>
>

RE: window every n elements instead of time based

Posted by Adrian Mocanu <am...@verticalscope.com>.

If there is somewhere I can +1 this feature let me know.

My use case is for financial indicators (math formulas) and a lot of them go by window count like moving average.

Thanks
A

From: Tathagata Das [mailto:tathagata.das1565@gmail.com]
Sent: February-26-14 2:05 PM
To: user@spark.apache.org
Cc: user@spark.incubator.apache.org
Subject: Re: window every n elements instead of time based

Currently, all in-built DStream operation is time-based windowing. We may provide count-based windowing in the future.

On Wed, Feb 26, 2014 at 9:34 AM, Adrian Mocanu <am...@verticalscope.com>> wrote:
Hi
Is there a way to do window processing but not based on time but every 6 items going through the stream?

Example:
Window of size 3 with 1 item "duration"
Stream data: 1,2,3,4,5,6,7
[1,2,3]=window 1
[2,3,4]=window 2
[3,4,5]=window 2
etc

-Adrian

Re: window every n elements instead of time based

Posted by Tathagata Das <ta...@gmail.com>.

Currently, all in-built DStream operation is time-based windowing. We may
provide count-based windowing in the future.

On Wed, Feb 26, 2014 at 9:34 AM, Adrian Mocanu <am...@verticalscope.com>wrote:

>  Hi
>
> Is there a way to do window processing but not based on time but every 6
> items going through the stream?
>
>
>
> Example:
>
> Window of size 3 with 1 item "duration"
>
> Stream data: 1,2,3,4,5,6,7
>
> [1,2,3]=window 1
>
> [2,3,4]=window 2
>
> [3,4,5]=window 2
>
> etc
>
>
>
> -Adrian
>
>
>