You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Nipun Batra <ba...@gmail.com> on 2014/10/14 17:52:50 UTC

Batch ID TxId

Hi

I have non ending data feed and I want to define a batch on hourly basis
i.e. set batch id for all the tuples coming in at particular hour. if I
write my custom spout how do I set batch ID / Tx Id

Later the data feed will be consumed from Kafka topic, If I plan to use
Kafka Spout again is there a way to batch OR TxID by hour.

I have looked at many examples but I am not able to find it.  Will
appreciate if you can point me to right direction OR any example of custom
spout setting batch id

I apologize if this is already asked, I tried to look around but found
nothing.

Thank you in advance
Nipun

Re: Batch ID TxId

Posted by Itai Frenkel <It...@forter.com>.
>> Now because of any reason machine with hourly aggregated data goes down I want missing hour tupples to replay from my queue.

An Hour is too long since Storm Spout would timeout way before that. Even though it is configurable I do not think it would be the right way of doing it.



There are many NoSQL products that come to mind that can perform aggregations (CouchBase, ElasticSearch, and most other K/V type of NoSQLs). You would put put storm in front of a NoSQL to reduce the data throughput (event consolidation) or perform extremely non-standard aggregations (that are not covered by a simple map-reduce script) or to if you must get real-time stats. Since you said your stats are not real-time, this leaves us with the following questions:

1. What is your raw event throughput ?

2. What type of aggregations are you trying to perform ?



Regards,

Itai


________________________________
From: Nipun Batra <ba...@gmail.com>
Sent: Wednesday, October 15, 2014 9:28 AM
To: user@storm.apache.org
Subject: Re: Batch ID TxId

Hi Yuval

Thanks for responding, Here is what I have in mind I was thinking to aggregate the data on hourly basis in memory and persisting every hour. Now because of any reason machine with hourly aggregated data goes down I want missing hour tupples to replay from my queue.  Any suggestions?

Regards
Nipun



On Tue, Oct 14, 2014 at 4:33 PM, Yuval Oren <yu...@n3twork.com>> wrote:
Nipun,

That seems to be contrary to the typical storm pattern of continuous processing. Is there a reason you can't continuously read new data? That might also scale better.

--
Yuval Oren
N3TWORK

On Oct 14, 2014, at 8:52 AM, Nipun Batra <ba...@gmail.com>> wrote:

Hi

I have non ending data feed and I want to define a batch on hourly basis i.e. set batch id for all the tuples coming in at particular hour. if I write my custom spout how do I set batch ID / Tx Id

Later the data feed will be consumed from Kafka topic, If I plan to use Kafka Spout again is there a way to batch OR TxID by hour.

I have looked at many examples but I am not able to find it.  Will appreciate if you can point me to right direction OR any example of custom spout setting batch id

I apologize if this is already asked, I tried to look around but found nothing.

Thank you in advance
Nipun




Re: Batch ID TxId

Posted by Nipun Batra <ba...@gmail.com>.
Hi Yuval

Thanks for responding, Here is what I have in mind I was thinking to
aggregate the data on hourly basis in memory and persisting every hour. Now
because of any reason machine with hourly aggregated data goes down I want
missing hour tupples to replay from my queue.  Any suggestions?

Regards
Nipun



On Tue, Oct 14, 2014 at 4:33 PM, Yuval Oren <yu...@n3twork.com> wrote:

> Nipun,
>
> That seems to be contrary to the typical storm pattern of continuous
> processing. Is there a reason you can’t continuously read new data? That
> might also scale better.
>
> --
> Yuval Oren
> *N3TWORK*
>
> On Oct 14, 2014, at 8:52 AM, Nipun Batra <ba...@gmail.com> wrote:
>
> Hi
>
> I have non ending data feed and I want to define a batch on hourly basis
> i.e. set batch id for all the tuples coming in at particular hour. if I
> write my custom spout how do I set batch ID / Tx Id
>
> Later the data feed will be consumed from Kafka topic, If I plan to use
> Kafka Spout again is there a way to batch OR TxID by hour.
>
> I have looked at many examples but I am not able to find it.  Will
> appreciate if you can point me to right direction OR any example of custom
> spout setting batch id
>
> I apologize if this is already asked, I tried to look around but found
> nothing.
>
> Thank you in advance
> Nipun
>
>
>

Re: Batch ID TxId

Posted by Yuval Oren <yu...@n3twork.com>.
Nipun,

That seems to be contrary to the typical storm pattern of continuous processing. Is there a reason you can’t continuously read new data? That might also scale better.

--
Yuval Oren
N3TWORK

> On Oct 14, 2014, at 8:52 AM, Nipun Batra <ba...@gmail.com> wrote:
> 
> Hi 
> 
> I have non ending data feed and I want to define a batch on hourly basis i.e. set batch id for all the tuples coming in at particular hour. if I write my custom spout how do I set batch ID / Tx Id
> 
> Later the data feed will be consumed from Kafka topic, If I plan to use Kafka Spout again is there a way to batch OR TxID by hour.
> 
> I have looked at many examples but I am not able to find it.  Will appreciate if you can point me to right direction OR any example of custom spout setting batch id
> 
> I apologize if this is already asked, I tried to look around but found nothing.
> 
> Thank you in advance 
> Nipun
>