You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Jonathan Poon <jk...@ucdavis.edu> on 2014/06/07 00:01:29 UTC

Time Partitioning of Tuples

Hi Everyone,

I'm currently investigating different data processing tools for an
application I'm interested in.  I have many sensors that I collect data
from.  However, I would like to group the data from every sensor at
predefined time intervals and process it together.

Using Storm terminology, I would have each sensor send data to a spout.
The spouts would then send tuples to a specific bolt that will process all
of the data within a specific time partition.  Each spout will tag each
event with a time id and each bolt will process data after collecting all
of the data with the same time id tags.

Is this possible with Storm?

I appreciate your help!

Jonathan

Re: Time Partitioning of Tuples

Posted by Jonathan Poon <jk...@ucdavis.edu>.
Hi Kyle,

I have a scientific application that has thousands of sensors spitting out
approximately 500,000 events per second.  I'm looking for a tool that can
process these events in a real-time manner.  In my application, I need to
read events from each of the sensors and create correlations between the
sensors based on the data.  You described it, I pretty much need a
real-time batch processing tool.

For my application, the computation for each time slice should be much
longer than the time slice itself.  I need a highly scalable tool that can
read all of the data, process the data from each time slice in parallel.

Jonathan


On Fri, Jun 6, 2014 at 5:04 PM, Jonathan Poon <jk...@ucdavis.edu> wrote:

> I will take a look into Trident as well.  Thanks for the tip!
>
>
> On Fri, Jun 6, 2014 at 3:53 PM, Kyle Nusbaum <kn...@yahoo-inc.com>
> wrote:
>
>>  Sounds interesting.
>>
>> I don't know much about your project, so I won't speculate about your
>> purposes.
>>
>> One thing to consider is that the duration of the computation on a time
>> slice must be longer than the time slice itself to really make this type of
>> setup worthwhile. Otherwise you could just feed the batches through the
>> same bolt, since it would be done processing a batch before the next one
>> comes in.
>>
>> -- Kyle
>>
>> On 06/06/2014 05:40 PM, Jonathan Poon wrote:
>>
>>    Hi Kyle,
>>
>>  I'm looking for a real-time batch processing tool.  In my case, I'm
>> looking to make correlations between all of the sensors at each time
>> interval.
>>
>>  I could use Hadoop (Map Reduce), but it requires I need to collect all
>> of the data before I can batch process each time partition of data from
>> each sensor.
>>
>>  Another tool I'm also looking at is Spark Streaming, which allows me to
>> collect data at different time intervals and processing that batch of data
>> using Map Reduce
>>
>>  However, Map Reduce seems inefficient because my sensor data is already
>> time sorted naturally.  In addition, I would like real-time data on the fly.
>>
>>  Seems like Storm might be a candidate for this application.  Please let
>> me know what you think...!  Thanks for your help!
>>
>> Jonathan
>>
>>
>>
>>
>> On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum <kn...@yahoo-inc.com>
>> wrote:
>>
>>>  You could send a signal tuple from the spout when it knows it's sent
>>> the last tuple for a time period, or include a field in the tuple for
>>> indicating it's the last member.
>>>
>>> I'm curious about why you want to do this, since the purpose of storm is
>>> to facilitate stream processing rather than the type of batch processing
>>> you're describing.
>>>
>>> -- Kyle
>>>
>>> On 06/06/2014 05:14 PM, Jonathan Poon wrote:
>>>
>>>  Hi Nathan,
>>>
>>>  The sensor data I have is naturally time sorted, since its just
>>> collecting data and emitting it to a spout. Is it possible for a bolt to
>>> know when all of the tuples with the same time tag have been collected and
>>> to start processing it together?  Or is it only possible for a bolt to
>>> process each tuple one at a time?
>>>
>>>  Thanks!
>>>
>>>
>>>
>>> On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung <nc...@gmail.com> wrote:
>>>
>>>> You can have your bolt subscribe to the spout using fields grouping and
>>>> use time tag as your key.
>>>>  On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jk...@ucdavis.edu> wrote:
>>>>
>>>>>    Hi Everyone,
>>>>>
>>>>>  I'm currently investigating different data processing tools for an
>>>>> application I'm interested in.  I have many sensors that I collect data
>>>>> from.  However, I would like to group the data from every sensor at
>>>>> predefined time intervals and process it together.
>>>>>
>>>>>  Using Storm terminology, I would have each sensor send data to a
>>>>> spout.  The spouts would then send tuples to a specific bolt that will
>>>>> process all of the data within a specific time partition.  Each spout will
>>>>> tag each event with a time id and each bolt will process data after
>>>>> collecting all of the data with the same time id tags.
>>>>>
>>>>>  Is this possible with Storm?
>>>>>
>>>>>  I appreciate your help!
>>>>>
>>>>>  Jonathan
>>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Time Partitioning of Tuples

Posted by Jonathan Poon <jk...@ucdavis.edu>.
I will take a look into Trident as well.  Thanks for the tip!


On Fri, Jun 6, 2014 at 3:53 PM, Kyle Nusbaum <kn...@yahoo-inc.com> wrote:

>  Sounds interesting.
>
> I don't know much about your project, so I won't speculate about your
> purposes.
>
> One thing to consider is that the duration of the computation on a time
> slice must be longer than the time slice itself to really make this type of
> setup worthwhile. Otherwise you could just feed the batches through the
> same bolt, since it would be done processing a batch before the next one
> comes in.
>
> -- Kyle
>
> On 06/06/2014 05:40 PM, Jonathan Poon wrote:
>
>    Hi Kyle,
>
>  I'm looking for a real-time batch processing tool.  In my case, I'm
> looking to make correlations between all of the sensors at each time
> interval.
>
>  I could use Hadoop (Map Reduce), but it requires I need to collect all of
> the data before I can batch process each time partition of data from each
> sensor.
>
>  Another tool I'm also looking at is Spark Streaming, which allows me to
> collect data at different time intervals and processing that batch of data
> using Map Reduce
>
>  However, Map Reduce seems inefficient because my sensor data is already
> time sorted naturally.  In addition, I would like real-time data on the fly.
>
>  Seems like Storm might be a candidate for this application.  Please let
> me know what you think...!  Thanks for your help!
>
> Jonathan
>
>
>
>
> On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum <kn...@yahoo-inc.com>
> wrote:
>
>>  You could send a signal tuple from the spout when it knows it's sent
>> the last tuple for a time period, or include a field in the tuple for
>> indicating it's the last member.
>>
>> I'm curious about why you want to do this, since the purpose of storm is
>> to facilitate stream processing rather than the type of batch processing
>> you're describing.
>>
>> -- Kyle
>>
>> On 06/06/2014 05:14 PM, Jonathan Poon wrote:
>>
>>  Hi Nathan,
>>
>>  The sensor data I have is naturally time sorted, since its just
>> collecting data and emitting it to a spout. Is it possible for a bolt to
>> know when all of the tuples with the same time tag have been collected and
>> to start processing it together?  Or is it only possible for a bolt to
>> process each tuple one at a time?
>>
>>  Thanks!
>>
>>
>>
>> On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung <nc...@gmail.com> wrote:
>>
>>> You can have your bolt subscribe to the spout using fields grouping and
>>> use time tag as your key.
>>>  On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jk...@ucdavis.edu> wrote:
>>>
>>>>    Hi Everyone,
>>>>
>>>>  I'm currently investigating different data processing tools for an
>>>> application I'm interested in.  I have many sensors that I collect data
>>>> from.  However, I would like to group the data from every sensor at
>>>> predefined time intervals and process it together.
>>>>
>>>>  Using Storm terminology, I would have each sensor send data to a
>>>> spout.  The spouts would then send tuples to a specific bolt that will
>>>> process all of the data within a specific time partition.  Each spout will
>>>> tag each event with a time id and each bolt will process data after
>>>> collecting all of the data with the same time id tags.
>>>>
>>>>  Is this possible with Storm?
>>>>
>>>>  I appreciate your help!
>>>>
>>>>  Jonathan
>>>>
>>>
>>
>>
>
>

Re: Time Partitioning of Tuples

Posted by Kyle Nusbaum <kn...@yahoo-inc.com>.
Sounds interesting.

I don't know much about your project, so I won't speculate about your 
purposes.

One thing to consider is that the duration of the computation on a time 
slice must be longer than the time slice itself to really make this type 
of setup worthwhile. Otherwise you could just feed the batches through 
the same bolt, since it would be done processing a batch before the next 
one comes in.

-- Kyle

On 06/06/2014 05:40 PM, Jonathan Poon wrote:
> Hi Kyle,
>
> I'm looking for a real-time batch processing tool.  In my case, I'm 
> looking to make correlations between all of the sensors at each time 
> interval.
>
> I could use Hadoop (Map Reduce), but it requires I need to collect all 
> of the data before I can batch process each time partition of data 
> from each sensor.
>
> Another tool I'm also looking at is Spark Streaming, which allows me 
> to collect data at different time intervals and processing that batch 
> of data using Map Reduce
>
> However, Map Reduce seems inefficient because my sensor data is 
> already time sorted naturally.  In addition, I would like real-time 
> data on the fly.
>
> Seems like Storm might be a candidate for this application. Please let 
> me know what you think...!  Thanks for your help!
>
> Jonathan
>
>
>
>
> On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum <knusbaum@yahoo-inc.com 
> <ma...@yahoo-inc.com>> wrote:
>
>     You could send a signal tuple from the spout when it knows it's
>     sent the last tuple for a time period, or include a field in the
>     tuple for indicating it's the last member.
>
>     I'm curious about why you want to do this, since the purpose of
>     storm is to facilitate stream processing rather than the type of
>     batch processing you're describing.
>
>     -- Kyle
>
>     On 06/06/2014 05:14 PM, Jonathan Poon wrote:
>>     Hi Nathan,
>>
>>     The sensor data I have is naturally time sorted, since its just
>>     collecting data and emitting it to a spout. Is it possible for a
>>     bolt to know when all of the tuples with the same time tag have
>>     been collected and to start processing it together?  Or is it
>>     only possible for a bolt to process each tuple one at a time?
>>
>>     Thanks!
>>
>>
>>
>>     On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung <ncleung@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         You can have your bolt subscribe to the spout using fields
>>         grouping and use time tag as your key.
>>
>>         On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jkpoon@ucdavis.edu
>>         <ma...@ucdavis.edu>> wrote:
>>
>>             Hi Everyone,
>>
>>             I'm currently investigating different data processing
>>             tools for an application I'm interested in.  I have many
>>             sensors that I collect data from. However, I would like
>>             to group the data from every sensor at predefined time
>>             intervals and process it together.
>>
>>             Using Storm terminology, I would have each sensor send
>>             data to a spout. The spouts would then send tuples to a
>>             specific bolt that will process all of the data within a
>>             specific time partition.  Each spout will tag each event
>>             with a time id and each bolt will process data after
>>             collecting all of the data with the same time id tags.
>>
>>             Is this possible with Storm?
>>
>>             I appreciate your help!
>>
>>             Jonathan
>>
>>
>
>


Re: Time Partitioning of Tuples

Posted by Shahab Yunus <sh...@gmail.com>.
I might be off-base here (experts can chime in) but have you looked at
Trident which is an API based on Storm? It does provide the concept of
aggregation to simulate micro-batches or in other words maintain state. You
can even persists this partial aggregated state too. If Spark's in-memory
M/R is not fast enough for you then you can take a look at that, perhaps.

Regards,
Shahab


On Fri, Jun 6, 2014 at 6:40 PM, Jonathan Poon <jk...@ucdavis.edu> wrote:

> Hi Kyle,
>
> I'm looking for a real-time batch processing tool.  In my case, I'm
> looking to make correlations between all of the sensors at each time
> interval.
>
> I could use Hadoop (Map Reduce), but it requires I need to collect all of
> the data before I can batch process each time partition of data from each
> sensor.
>
> Another tool I'm also looking at is Spark Streaming, which allows me to
> collect data at different time intervals and processing that batch of data
> using Map Reduce
>
> However, Map Reduce seems inefficient because my sensor data is already
> time sorted naturally.  In addition, I would like real-time data on the fly.
>
> Seems like Storm might be a candidate for this application.  Please let me
> know what you think...!  Thanks for your help!
>
> Jonathan
>
>
>
>
> On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum <kn...@yahoo-inc.com>
> wrote:
>
>>  You could send a signal tuple from the spout when it knows it's sent
>> the last tuple for a time period, or include a field in the tuple for
>> indicating it's the last member.
>>
>> I'm curious about why you want to do this, since the purpose of storm is
>> to facilitate stream processing rather than the type of batch processing
>> you're describing.
>>
>> -- Kyle
>>
>> On 06/06/2014 05:14 PM, Jonathan Poon wrote:
>>
>>  Hi Nathan,
>>
>>  The sensor data I have is naturally time sorted, since its just
>> collecting data and emitting it to a spout. Is it possible for a bolt to
>> know when all of the tuples with the same time tag have been collected and
>> to start processing it together?  Or is it only possible for a bolt to
>> process each tuple one at a time?
>>
>>  Thanks!
>>
>>
>>
>> On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung <nc...@gmail.com> wrote:
>>
>>> You can have your bolt subscribe to the spout using fields grouping and
>>> use time tag as your key.
>>>  On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jk...@ucdavis.edu> wrote:
>>>
>>>>    Hi Everyone,
>>>>
>>>>  I'm currently investigating different data processing tools for an
>>>> application I'm interested in.  I have many sensors that I collect data
>>>> from.  However, I would like to group the data from every sensor at
>>>> predefined time intervals and process it together.
>>>>
>>>>  Using Storm terminology, I would have each sensor send data to a
>>>> spout.  The spouts would then send tuples to a specific bolt that will
>>>> process all of the data within a specific time partition.  Each spout will
>>>> tag each event with a time id and each bolt will process data after
>>>> collecting all of the data with the same time id tags.
>>>>
>>>>  Is this possible with Storm?
>>>>
>>>>  I appreciate your help!
>>>>
>>>>  Jonathan
>>>>
>>>
>>
>>
>

RE: Time Partitioning of Tuples

Posted by Lesego Peter <l....@newcastle.ac.uk>.
Thanks, that has really helped.

-Lesego
________________________________
From: Dan [dcieslak@hotmail.com]
Sent: 08 June 2014 23:30
To: user@storm.incubator.apache.org
Subject: RE: Time Partitioning of Tuples

I've not used it, but you might look at:

https://github.com/buildlackey/cep/tree/master/esper%2Bstorm%2Bkafka

-Dan

________________________________
From: l.p.peter@newcastle.ac.uk
To: user@storm.incubator.apache.org
Subject: RE: Time Partitioning of Tuples
Date: Sun, 8 Jun 2014 21:40:37 +0000

Hi,
Is there anybody who has already embedded Esper into Storm?

-Lesego
________________________________
From: Dan [dcieslak@hotmail.com]
Sent: 07 June 2014 02:04
To: user@storm.incubator.apache.org
Subject: RE: Time Partitioning of Tuples

You might look at Esper. I believe someone has even embedded Esper into Storm

-Dan

________________________________
Date: Fri, 6 Jun 2014 15:40:08 -0700
Subject: Re: Time Partitioning of Tuples
From: jkpoon@ucdavis.edu
To: user@storm.incubator.apache.org

Hi Kyle,

I'm looking for a real-time batch processing tool.  In my case, I'm looking to make correlations between all of the sensors at each time interval.

I could use Hadoop (Map Reduce), but it requires I need to collect all of the data before I can batch process each time partition of data from each sensor.

Another tool I'm also looking at is Spark Streaming, which allows me to collect data at different time intervals and processing that batch of data using Map Reduce

However, Map Reduce seems inefficient because my sensor data is already time sorted naturally.  In addition, I would like real-time data on the fly.

Seems like Storm might be a candidate for this application.  Please let me know what you think...!  Thanks for your help!

Jonathan




On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum <kn...@yahoo-inc.com>> wrote:
You could send a signal tuple from the spout when it knows it's sent the last tuple for a time period, or include a field in the tuple for indicating it's the last member.

I'm curious about why you want to do this, since the purpose of storm is to facilitate stream processing rather than the type of batch processing you're describing.

-- Kyle

On 06/06/2014 05:14 PM, Jonathan Poon wrote:
Hi Nathan,

The sensor data I have is naturally time sorted, since its just collecting data and emitting it to a spout. Is it possible for a bolt to know when all of the tuples with the same time tag have been collected and to start processing it together?  Or is it only possible for a bolt to process each tuple one at a time?

Thanks!



On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung <nc...@gmail.com>> wrote:

You can have your bolt subscribe to the spout using fields grouping and use time tag as your key.

On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jk...@ucdavis.edu>> wrote:
Hi Everyone,

I'm currently investigating different data processing tools for an application I'm interested in.  I have many sensors that I collect data from.  However, I would like to group the data from every sensor at predefined time intervals and process it together.

Using Storm terminology, I would have each sensor send data to a spout.  The spouts would then send tuples to a specific bolt that will process all of the data within a specific time partition.  Each spout will tag each event with a time id and each bolt will process data after collecting all of the data with the same time id tags.

Is this possible with Storm?

I appreciate your help!

Jonathan




RE: Time Partitioning of Tuples

Posted by Dan <dc...@hotmail.com>.
I've not used it, but you might look at:
https://github.com/buildlackey/cep/tree/master/esper%2Bstorm%2Bkafka
-Dan

From: l.p.peter@newcastle.ac.uk
To: user@storm.incubator.apache.org
Subject: RE: Time Partitioning of Tuples
Date: Sun, 8 Jun 2014 21:40:37 +0000







Hi,
Is there anybody who has already embedded Esper into Storm?



-Lesego



From: Dan [dcieslak@hotmail.com]

Sent: 07 June 2014 02:04

To: user@storm.incubator.apache.org

Subject: RE: Time Partitioning of Tuples






You might look at Esper. I believe someone has even embedded Esper into Storm



-Dan





Date: Fri, 6 Jun 2014 15:40:08 -0700

Subject: Re: Time Partitioning of Tuples

From: jkpoon@ucdavis.edu

To: user@storm.incubator.apache.org








Hi Kyle,




I'm looking for a real-time batch processing tool.  In my case, I'm looking to make correlations between all of the sensors at each time interval.




I could use Hadoop (Map Reduce), but it requires I need to collect all of the data before I can batch process each time partition of data from each sensor.




Another tool I'm also looking at is Spark Streaming, which allows me to collect data at different time intervals and processing that batch of data using Map Reduce




However, Map Reduce seems inefficient because my sensor data is already time sorted naturally.  In addition, I would like real-time data on the fly.




Seems like Storm might be a candidate for this application.  Please let me know what you think...!  Thanks for your help!



Jonathan











On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum 
<kn...@yahoo-inc.com> wrote:




You could send a signal tuple from the spout when it knows it's sent the last tuple for a time period, or include a field in the tuple for indicating it's the last member.




I'm curious about why you want to do this, since the purpose of storm is to facilitate stream processing rather than the type of batch processing you're describing.
-- Kyle
On 06/06/2014 05:14 PM, Jonathan Poon wrote:








Hi Nathan,




The sensor data I have is naturally time sorted, since its just collecting data and emitting it to a spout. Is it possible for a bolt to know when all of the tuples with the same time tag have been collected and to start processing it together?  Or is it only
 possible for a bolt to process each tuple one at a time?




Thanks!

   







On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung 
<nc...@gmail.com> wrote:


You can have your bolt subscribe to the spout using fields grouping and use time tag as your key.


On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jk...@ucdavis.edu> wrote:







Hi Everyone,




I'm currently investigating different data processing tools for an application I'm interested in.  I have many sensors that I collect data from.  However, I would like to group the data from every sensor at predefined time intervals and process it together. 





Using Storm terminology, I would have each sensor send data to a spout.  The spouts would then send tuples to a specific bolt that will process all of the data within a specific time partition.  Each spout will tag each event with a time id and each bolt will
 process data after collecting all of the data with the same time id tags.




Is this possible with Storm?




I appreciate your help!




Jonathan































 		 	   		  

RE: Time Partitioning of Tuples

Posted by Lesego Peter <l....@newcastle.ac.uk>.
Hi,
Is there anybody who has already embedded Esper into Storm?

-Lesego
________________________________
From: Dan [dcieslak@hotmail.com]
Sent: 07 June 2014 02:04
To: user@storm.incubator.apache.org
Subject: RE: Time Partitioning of Tuples

You might look at Esper. I believe someone has even embedded Esper into Storm

-Dan

________________________________
Date: Fri, 6 Jun 2014 15:40:08 -0700
Subject: Re: Time Partitioning of Tuples
From: jkpoon@ucdavis.edu
To: user@storm.incubator.apache.org

Hi Kyle,

I'm looking for a real-time batch processing tool.  In my case, I'm looking to make correlations between all of the sensors at each time interval.

I could use Hadoop (Map Reduce), but it requires I need to collect all of the data before I can batch process each time partition of data from each sensor.

Another tool I'm also looking at is Spark Streaming, which allows me to collect data at different time intervals and processing that batch of data using Map Reduce

However, Map Reduce seems inefficient because my sensor data is already time sorted naturally.  In addition, I would like real-time data on the fly.

Seems like Storm might be a candidate for this application.  Please let me know what you think...!  Thanks for your help!

Jonathan




On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum <kn...@yahoo-inc.com>> wrote:
You could send a signal tuple from the spout when it knows it's sent the last tuple for a time period, or include a field in the tuple for indicating it's the last member.

I'm curious about why you want to do this, since the purpose of storm is to facilitate stream processing rather than the type of batch processing you're describing.

-- Kyle

On 06/06/2014 05:14 PM, Jonathan Poon wrote:
Hi Nathan,

The sensor data I have is naturally time sorted, since its just collecting data and emitting it to a spout. Is it possible for a bolt to know when all of the tuples with the same time tag have been collected and to start processing it together?  Or is it only possible for a bolt to process each tuple one at a time?

Thanks!



On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung <nc...@gmail.com>> wrote:

You can have your bolt subscribe to the spout using fields grouping and use time tag as your key.

On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jk...@ucdavis.edu>> wrote:
Hi Everyone,

I'm currently investigating different data processing tools for an application I'm interested in.  I have many sensors that I collect data from.  However, I would like to group the data from every sensor at predefined time intervals and process it together.

Using Storm terminology, I would have each sensor send data to a spout.  The spouts would then send tuples to a specific bolt that will process all of the data within a specific time partition.  Each spout will tag each event with a time id and each bolt will process data after collecting all of the data with the same time id tags.

Is this possible with Storm?

I appreciate your help!

Jonathan




RE: Time Partitioning of Tuples

Posted by Dan <dc...@hotmail.com>.
You might look at Esper. I believe someone has even embedded Esper into Storm
-Dan

Date: Fri, 6 Jun 2014 15:40:08 -0700
Subject: Re: Time Partitioning of Tuples
From: jkpoon@ucdavis.edu
To: user@storm.incubator.apache.org

Hi Kyle,

I'm looking for a real-time batch processing tool.  In my case, I'm looking to make correlations between all of the sensors at each time interval.


I could use Hadoop (Map Reduce), but it requires I need to collect all of the data before I can batch process each time partition of data from each sensor.

Another tool I'm also looking at is Spark Streaming, which allows me to collect data at different time intervals and processing that batch of data using Map Reduce


However, Map Reduce seems inefficient because my sensor data is already time sorted naturally.  In addition, I would like real-time data on the fly.

Seems like Storm might be a candidate for this application.  Please let me know what you think...!  Thanks for your help!


Jonathan




On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum <kn...@yahoo-inc.com> wrote:


  
    
  
  
    You could send a signal tuple from the
      spout when it knows it's sent the last tuple for a time period, or
      include a field in the tuple for indicating it's the last member.
      

      

      I'm curious about why you want to do this, since the purpose of
      storm is to facilitate stream processing rather than the type of
      batch processing you're describing.
      -- Kyle
      On 06/06/2014 05:14 PM, Jonathan Poon wrote:

    
    
      
      
        
          Hi Nathan,

            

          
          The sensor data I have is naturally time sorted, since its
          just collecting data and emitting it to a spout. Is it
          possible for a bolt to know when all of the tuples with the
          same time tag have been collected and to start processing it
          together?  Or is it only possible for a bolt to process each
          tuple one at a time?

          

        
        Thanks!

             

        
      
      

        

        On Fri, Jun 6, 2014 at 3:07 PM, Nathan
          Leung <nc...@gmail.com>
          wrote:

          
            You can have your bolt subscribe to the spout
              using fields grouping and use time tag as your key.
            
              
                On Jun 6, 2014 6:01 PM,
                  "Jonathan Poon" <jk...@ucdavis.edu>
                  wrote:

                  
                    
                      
                        
                          
                            
                              Hi Everyone,

                                

                              
                              I'm currently investigating different data
                              processing tools for an application I'm
                              interested in.  I have many sensors that I
                              collect data from.  However, I would like
                              to group the data from every sensor at
                              predefined time intervals and process it
                              together.  

                              

                            
                            Using Storm terminology, I would have each
                            sensor send data to a spout.  The spouts
                            would then send tuples to a specific bolt
                            that will process all of the data within a
                            specific time partition.  Each spout will
                            tag each event with a time id and each bolt
                            will process data after collecting all of
                            the data with the same time id tags.

                            

                          
                          Is this possible with Storm?

                          

                        
                        I appreciate your help!

                        

                      
                      Jonathan

                    
                  
                
              
            
          
        
        

      
    
    

  


 		 	   		  

Re: Time Partitioning of Tuples

Posted by Jonathan Poon <jk...@ucdavis.edu>.
Hi Kyle,

I'm looking for a real-time batch processing tool.  In my case, I'm looking
to make correlations between all of the sensors at each time interval.

I could use Hadoop (Map Reduce), but it requires I need to collect all of
the data before I can batch process each time partition of data from each
sensor.

Another tool I'm also looking at is Spark Streaming, which allows me to
collect data at different time intervals and processing that batch of data
using Map Reduce

However, Map Reduce seems inefficient because my sensor data is already
time sorted naturally.  In addition, I would like real-time data on the fly.

Seems like Storm might be a candidate for this application.  Please let me
know what you think...!  Thanks for your help!

Jonathan




On Fri, Jun 6, 2014 at 3:32 PM, Kyle Nusbaum <kn...@yahoo-inc.com> wrote:

>  You could send a signal tuple from the spout when it knows it's sent the
> last tuple for a time period, or include a field in the tuple for
> indicating it's the last member.
>
> I'm curious about why you want to do this, since the purpose of storm is
> to facilitate stream processing rather than the type of batch processing
> you're describing.
>
> -- Kyle
>
> On 06/06/2014 05:14 PM, Jonathan Poon wrote:
>
>  Hi Nathan,
>
>  The sensor data I have is naturally time sorted, since its just
> collecting data and emitting it to a spout. Is it possible for a bolt to
> know when all of the tuples with the same time tag have been collected and
> to start processing it together?  Or is it only possible for a bolt to
> process each tuple one at a time?
>
>  Thanks!
>
>
>
> On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung <nc...@gmail.com> wrote:
>
>> You can have your bolt subscribe to the spout using fields grouping and
>> use time tag as your key.
>>  On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jk...@ucdavis.edu> wrote:
>>
>>>    Hi Everyone,
>>>
>>>  I'm currently investigating different data processing tools for an
>>> application I'm interested in.  I have many sensors that I collect data
>>> from.  However, I would like to group the data from every sensor at
>>> predefined time intervals and process it together.
>>>
>>>  Using Storm terminology, I would have each sensor send data to a
>>> spout.  The spouts would then send tuples to a specific bolt that will
>>> process all of the data within a specific time partition.  Each spout will
>>> tag each event with a time id and each bolt will process data after
>>> collecting all of the data with the same time id tags.
>>>
>>>  Is this possible with Storm?
>>>
>>>  I appreciate your help!
>>>
>>>  Jonathan
>>>
>>
>
>

Re: Time Partitioning of Tuples

Posted by Kyle Nusbaum <kn...@yahoo-inc.com>.
You could send a signal tuple from the spout when it knows it's sent the 
last tuple for a time period, or include a field in the tuple for 
indicating it's the last member.

I'm curious about why you want to do this, since the purpose of storm is 
to facilitate stream processing rather than the type of batch processing 
you're describing.

-- Kyle

On 06/06/2014 05:14 PM, Jonathan Poon wrote:
> Hi Nathan,
>
> The sensor data I have is naturally time sorted, since its just 
> collecting data and emitting it to a spout. Is it possible for a bolt 
> to know when all of the tuples with the same time tag have been 
> collected and to start processing it together?  Or is it only possible 
> for a bolt to process each tuple one at a time?
>
> Thanks!
>
>
>
> On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung <ncleung@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     You can have your bolt subscribe to the spout using fields
>     grouping and use time tag as your key.
>
>     On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jkpoon@ucdavis.edu
>     <ma...@ucdavis.edu>> wrote:
>
>         Hi Everyone,
>
>         I'm currently investigating different data processing tools
>         for an application I'm interested in.  I have many sensors
>         that I collect data from.  However, I would like to group the
>         data from every sensor at predefined time intervals and
>         process it together.
>
>         Using Storm terminology, I would have each sensor send data to
>         a spout.  The spouts would then send tuples to a specific bolt
>         that will process all of the data within a specific time
>         partition.  Each spout will tag each event with a time id and
>         each bolt will process data after collecting all of the data
>         with the same time id tags.
>
>         Is this possible with Storm?
>
>         I appreciate your help!
>
>         Jonathan
>
>


Re: Time Partitioning of Tuples

Posted by Kyle Nusbaum <kn...@yahoo-inc.com>.
You could send a signal tuple from the spout when it knows it's sent the 
last tuple for a time period, or include a field in the tuple for 
indicating it's the last member.

I'm curious about why you want to do this, since the purpose of storm is 
to facilitate stream processing rather than the type of batch processing 
you're describing.

-- Kyle

On 06/06/2014 05:14 PM, Jonathan Poon wrote:
> Hi Nathan,
>
> The sensor data I have is naturally time sorted, since its just 
> collecting data and emitting it to a spout. Is it possible for a bolt 
> to know when all of the tuples with the same time tag have been 
> collected and to start processing it together?  Or is it only possible 
> for a bolt to process each tuple one at a time?
>
> Thanks!
>
>
>
> On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung <ncleung@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     You can have your bolt subscribe to the spout using fields
>     grouping and use time tag as your key.
>
>     On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jkpoon@ucdavis.edu
>     <ma...@ucdavis.edu>> wrote:
>
>         Hi Everyone,
>
>         I'm currently investigating different data processing tools
>         for an application I'm interested in.  I have many sensors
>         that I collect data from.  However, I would like to group the
>         data from every sensor at predefined time intervals and
>         process it together.
>
>         Using Storm terminology, I would have each sensor send data to
>         a spout.  The spouts would then send tuples to a specific bolt
>         that will process all of the data within a specific time
>         partition.  Each spout will tag each event with a time id and
>         each bolt will process data after collecting all of the data
>         with the same time id tags.
>
>         Is this possible with Storm?
>
>         I appreciate your help!
>
>         Jonathan
>
>


Re: Time Partitioning of Tuples

Posted by Jonathan Poon <jk...@ucdavis.edu>.
Hi Nathan,

The sensor data I have is naturally time sorted, since its just collecting
data and emitting it to a spout. Is it possible for a bolt to know when all
of the tuples with the same time tag have been collected and to start
processing it together?  Or is it only possible for a bolt to process each
tuple one at a time?

Thanks!



On Fri, Jun 6, 2014 at 3:07 PM, Nathan Leung <nc...@gmail.com> wrote:

> You can have your bolt subscribe to the spout using fields grouping and
> use time tag as your key.
> On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jk...@ucdavis.edu> wrote:
>
>> Hi Everyone,
>>
>> I'm currently investigating different data processing tools for an
>> application I'm interested in.  I have many sensors that I collect data
>> from.  However, I would like to group the data from every sensor at
>> predefined time intervals and process it together.
>>
>> Using Storm terminology, I would have each sensor send data to a spout.
>> The spouts would then send tuples to a specific bolt that will process all
>> of the data within a specific time partition.  Each spout will tag each
>> event with a time id and each bolt will process data after collecting all
>> of the data with the same time id tags.
>>
>> Is this possible with Storm?
>>
>> I appreciate your help!
>>
>> Jonathan
>>
>

Re: Time Partitioning of Tuples

Posted by Nathan Leung <nc...@gmail.com>.
You can have your bolt subscribe to the spout using fields grouping and use
time tag as your key.
On Jun 6, 2014 6:01 PM, "Jonathan Poon" <jk...@ucdavis.edu> wrote:

> Hi Everyone,
>
> I'm currently investigating different data processing tools for an
> application I'm interested in.  I have many sensors that I collect data
> from.  However, I would like to group the data from every sensor at
> predefined time intervals and process it together.
>
> Using Storm terminology, I would have each sensor send data to a spout.
> The spouts would then send tuples to a specific bolt that will process all
> of the data within a specific time partition.  Each spout will tag each
> event with a time id and each bolt will process data after collecting all
> of the data with the same time id tags.
>
> Is this possible with Storm?
>
> I appreciate your help!
>
> Jonathan
>