You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by "Kalogeropoulos, Andreas" <An...@emc.com> on 2015/11/30 17:42:24 UTC

Using Storm to parse emails and creates batches

Hello,

I want to use Storm to do three things :

1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP source

2.       Add additional information (based on sender email)

3.       Create an XML based on this data, to inject in another solution

Only issue, I want step 1 (and 2) to be as fast as possible so creating the maximum bolts/tasks possible,
But I want the XML to be as big as possible so gathering information for multiple output of bolts.

In this logic, I fi have 100 mails per second in original input, I would want to have step1 and step 2 to work on the smallest number of emails to do it faster.
But I still want to be able to have an XML that represent 10 000+ emails at the end.

I can't think of topology to address this.
Can someone give me some pointers to the best way to handle this ?


Kind Regards,
Andréas Kalogéropoulos

RE: Using Storm to parse emails and creates batches

Posted by "Kalogeropoulos, Andreas" <An...@emc.com>.

Hello Nick,

I think you are right. It is probably the state that I am not taking into consideration in my logic.
And it is probably only in the last step.

The first is just “extract”, so as you say, I need a “filtering” bolt to just take out what I need
The second is probably going to read from a Cassandra database, and add elements to my tuples, based on the keys coming from y tuples.

It is the third one that needs a state, because I want to wait from multiple outputs coming from previous bolt to work. If second step is working with a list of 10 tuples, I would need to wait for (example) 100 of them, to create and XML with 1000 items (100 X 10 tuples).
Hence it is indeed the “in a logical window of operation (either purely time-based or tuple-based).” That I need to implement. Thanks for the pointers.
If you have any “watch out” on best practices or limitation or example code, it is more than welcomed, but at least you got me going in the direction I need.
Thanks.


Kind Regards,
Andréas Kalogéropoulos

From: Nick R. Katsipoulakis [mailto:nick.katsip@gmail.com]
Sent: Monday, November 30, 2015 5:55 PM
To: user@storm.apache.org
Subject: Re: Using Storm to parse emails and creates batches

Hello Andrea,

Please check my inline answers below.  However, I think its not the topology that is puzzling you (since you already defined the workflow in steps), rather, the semantics of data involved. To be more precise, you seem to need some state maintained on different bolts. You have to define how often the state is updated, where it is stored, whether it is window-based or is historically accumulated etc. Also, if you manage to have your operators work in a stete-less way (apply functions on each input tuple), then the challenging part would be to mitigate any I/O (i.e. contact an external storage) and the processing cost. I hope that you will find my email useful.


On Mon, Nov 30, 2015 at 11:42 AM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello,

I want to use Storm to do three things :

1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP source
For this part, you have to consider the semantics of your processing. For instance, does the processing involve any state maintenance? If not, it is simply a "filtering" bolt, so you can be really flexible on its performance. In fact, you can start with an initial parallelism hint (number of threads executing the filtering mechanism) and then either scale-up/down according to the actual performance during runtime (capacity reached by those bolts)

2.       Add additional information (based on sender email)
This part looks like its going to perform an I/O in order to get more information (right?). If yes, you need to consider different engineering ways on how you can retrieve these data. If not, and you get additional information from the actual mail, then again you can apply the same idea as in Step 1.

3.       Create an XML based on this data, to inject in another solution
This part is tricky because it is not clear to me whether those XMLs contain aggregated information, or they are build separately based on the input that each bolt receives. In the former case, you will need to engineer your desired aggregate operations based on your application semantics. In the latter, each bolt can produce its XML based on the input it received in a logical window of operation (either purely time-based or tuple-based).

Only issue, I want step 1 (and 2) to be as fast as possible so creating the maximum bolts/tasks possible,
But I want the XML to be as big as possible so gathering information for multiple output of bolts.

In this logic, I fi have 100 mails per second in original input, I would want to have step1 and step 2 to work on the smallest number of emails to do it faster.
But I still want to be able to have an XML that represent 10 000+ emails at the end.

I can’t think of topology to address this.
Can someone give me some pointers to the best way to handle this ?


Kind Regards,
Andréas Kalogéropoulos




--
Nick R. Katsipoulakis,
Department of Computer Science
University of Pittsburgh

Re: Using Storm to parse emails and creates batches

Posted by "Nick R. Katsipoulakis" <ni...@gmail.com>.

Hello Andrea,

Please check my inline answers below.  However, I think its not the
topology that is puzzling you (since you already defined the workflow in
steps), rather, the semantics of data involved. To be more precise, you
seem to need some state maintained on different bolts. You have to define
how often the state is updated, where it is stored, whether it is
window-based or is historically accumulated etc. Also, if you manage to
have your operators work in a stete-less way (apply functions on each input
tuple), then the challenging part would be to mitigate any I/O (i.e.
contact an external storage) and the processing cost. I hope that you will
find my email useful.


On Mon, Nov 30, 2015 at 11:42 AM, Kalogeropoulos, Andreas <
Andreas.Kalogeropoulos@emc.com> wrote:

> Hello,
>
>
>
> I want to use Storm to do three things :
>
> 1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP
> source
>
For this part, you have to consider the semantics of your processing. For
instance, does the processing involve any state maintenance? If not, it is
simply a "filtering" bolt, so you can be really flexible on its
performance. In fact, you can start with an initial parallelism hint
(number of threads executing the filtering mechanism) and then either
scale-up/down according to the actual performance during runtime (capacity
reached by those bolts)

> 2.       Add additional information (based on sender email)
>
This part looks like its going to perform an I/O in order to get more
information (right?). If yes, you need to consider different engineering
ways on how you can retrieve these data. If not, and you get additional
information from the actual mail, then again you can apply the same idea as
in Step 1.

> 3.       Create an XML based on this data, to inject in another solution
>
This part is tricky because it is not clear to me whether those XMLs
contain aggregated information, or they are build separately based on the
input that each bolt receives. In the former case, you will need to
engineer your desired aggregate operations based on your application
semantics. In the latter, each bolt can produce its XML based on the input
it received in a logical window of operation (either purely time-based or
tuple-based).

>
>
> Only issue, I want step 1 (and 2) to be as fast as possible so creating
> the maximum bolts/tasks possible,
>
> But I want the XML to be as big as possible so gathering information for
> multiple output of bolts.
>
>
>
> In this logic, I fi have 100 mails per second in original input, I would
> want to have step1 and step 2 to work on the smallest number of emails to
> do it faster.
>
> But I still want to be able to have an XML that represent 10 000+ emails
> at the end.
>
>
>
> I can’t think of topology to address this.
>
> Can someone give me some pointers to the best way to handle this ?
>
>
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>



-- 
Nick R. Katsipoulakis,
Department of Computer Science
University of Pittsburgh

RE: Using Storm to parse emails and creates batches

Posted by "Kalogeropoulos, Andreas" <An...@emc.com>.

Hello Stephen,

To make my example more realistic :
The first bolt will analyze a list of 100 tuples
And the last bolt, will probably wait for  10 000 list of tuples before creating the XML.

Kind Regards,
Andréas Kalogéropoulos

From: Kalogeropoulos, Andreas [mailto:Andreas.Kalogeropoulos@emc.com]
Sent: Tuesday, December 01, 2015 11:48 AM
To: user@storm.apache.org
Subject: RE: Using Storm to parse emails and creates batches

Hello Stephen,

Imagine that the spout is providing me 300 000 emails per hour.
The first bolt will parse/analyze the information (from, to, cc, subject, object, date, has of attachments, …  , and probably will find the same hash for some attachments (someone forwarding an email).

The last bolt will create an XML based on all this information, but if I can have the tuples containing the same attachment (based on hash) in the same XML, I can actually apply a dedup logic : having multiple lines in my xml pointing to the same file

Does this make more sense ?

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com]
Sent: Tuesday, December 01, 2015 11:36 AM
To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

I'm not sure I follow/understand your question or what you're trying to do.

On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
You are right. Sorry for making you state the obvious ☺.

Last question : If my spout has incoming information that I want to have in the same last bolt (the one creating the XML) for deduplication logic, what is the best way to achieve this ?  My instinct says to try to work with Fields grouping and the correct key (probably conversation since I am working with emails).

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Tuesday, December 01, 2015 10:27 AM

To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

If you are using Storm's guaranteed message processing<http://storm.apache.org/documentation/Guaranteeing-message-processing.html> there is no need to 'persist' the collection anywhere other than in memory.  IE List<Tuple> myListOfTuples = new ArrayList<Tuple>();  If the third bolt crashes and loses its in memory collection, after Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS<http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS> the tuples will timeout and be replayed thru your entire storm topology and your collection will be repopulated

On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello Stephen,

I think you got I correctly. Thanks a lot for the idea.
If you have seen limitations, please send the disclaimers ☺ . For example, how did you handle persistence of this collection ? If the third bolt failed while populating the collection (size and time has not been reached) we just lost everything, so I need to have a status loopback of what was really output. Right ?

Of course, if you can send me the code of your third bolt (especially the collection handling), I’ll be grateful.
In all cases, thanks a lot for your help, even without the code, you really give me example advice, and now I can start building something.

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Monday, November 30, 2015 5:55 PM
To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

From what I understand from your description, you want bolt 3 to collect results from multiple tuples and build a single xml for them.  We've done this by essentially doing the following:

Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the collection and check the size of the collection.  Once the size of the collection exceeds some number, we then process all of the tuples in one go, and then ACK all of them after the processing completes.

Building on that, we've implemented an additional constraint on time.  If the collection size > N OR if we've waited more than X seconds, process the batch.  This way your output won't stall out if your topology has a lull in data being ingested.
And then lastly, there's a corner case where say 10 tuples come in and get held by our collection but then no other tuples come in for a long period of time.  If no tuples enter, that means the size and timeout checks are never executed and your bolt will hold onto those tuples for a long time (potentially causing timeouts).  To handle this, we made use of tick tuples.  Tick tuples essentially allow you to you to send a special tuple to your bolt every Y seconds.  We use that to trigger checking the time constraint is checked on a regular basis (example being send a tick tuple every 1, 5, or 10 seconds)

On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello,

I want to use Storm to do three things :

1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP source

2.       Add additional information (based on sender email)

3.       Create an XML based on this data, to inject in another solution

Only issue, I want step 1 (and 2) to be as fast as possible so creating the maximum bolts/tasks possible,
But I want the XML to be as big as possible so gathering information for multiple output of bolts.

In this logic, I fi have 100 mails per second in original input, I would want to have step1 and step 2 to work on the smallest number of emails to do it faster.
But I still want to be able to have an XML that represent 10 000+ emails at the end.

I can’t think of topology to address this.
Can someone give me some pointers to the best way to handle this ?


Kind Regards,
Andréas Kalogéropoulos

RE: Using Storm to parse emails and creates batches

Posted by "Kalogeropoulos, Andreas" <An...@emc.com>.

Thanks Stephen. Really appreciated.

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com]
Sent: Tuesday, December 01, 2015 12:45 PM
To: user@storm.apache.org
Subject: Re: Using Storm to parse emails and creates batches

Yep, sounds like you got it......you'd want to use field grouping and group on a field that contains the hash.  Then every tuple that has that field with the identical hashes would get sent to the same bolt instance.

On Tue, Dec 1, 2015 at 8:23 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Making sure that duplicates make it in the same XML file (third bolt).

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Tuesday, December 01, 2015 11:59 AM

To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

So you want to eliminate duplicates or make sure that duplicates make it into the same XML file (third bolt)?

On Tue, Dec 1, 2015 at 7:48 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello Stephen,

Imagine that the spout is providing me 300 000 emails per hour.
The first bolt will parse/analyze the information (from, to, cc, subject, object, date, has of attachments, …  , and probably will find the same hash for some attachments (someone forwarding an email).

The last bolt will create an XML based on all this information, but if I can have the tuples containing the same attachment (based on hash) in the same XML, I can actually apply a dedup logic : having multiple lines in my xml pointing to the same file

Does this make more sense ?

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Tuesday, December 01, 2015 11:36 AM

To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

I'm not sure I follow/understand your question or what you're trying to do.

On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
You are right. Sorry for making you state the obvious ☺.

Last question : If my spout has incoming information that I want to have in the same last bolt (the one creating the XML) for deduplication logic, what is the best way to achieve this ?  My instinct says to try to work with Fields grouping and the correct key (probably conversation since I am working with emails).

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Tuesday, December 01, 2015 10:27 AM

To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

If you are using Storm's guaranteed message processing<http://storm.apache.org/documentation/Guaranteeing-message-processing.html> there is no need to 'persist' the collection anywhere other than in memory.  IE List<Tuple> myListOfTuples = new ArrayList<Tuple>();  If the third bolt crashes and loses its in memory collection, after Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS<http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS> the tuples will timeout and be replayed thru your entire storm topology and your collection will be repopulated

On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello Stephen,

I think you got I correctly. Thanks a lot for the idea.
If you have seen limitations, please send the disclaimers ☺ . For example, how did you handle persistence of this collection ? If the third bolt failed while populating the collection (size and time has not been reached) we just lost everything, so I need to have a status loopback of what was really output. Right ?

Of course, if you can send me the code of your third bolt (especially the collection handling), I’ll be grateful.
In all cases, thanks a lot for your help, even without the code, you really give me example advice, and now I can start building something.

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Monday, November 30, 2015 5:55 PM
To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

From what I understand from your description, you want bolt 3 to collect results from multiple tuples and build a single xml for them.  We've done this by essentially doing the following:

Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the collection and check the size of the collection.  Once the size of the collection exceeds some number, we then process all of the tuples in one go, and then ACK all of them after the processing completes.

Building on that, we've implemented an additional constraint on time.  If the collection size > N OR if we've waited more than X seconds, process the batch.  This way your output won't stall out if your topology has a lull in data being ingested.
And then lastly, there's a corner case where say 10 tuples come in and get held by our collection but then no other tuples come in for a long period of time.  If no tuples enter, that means the size and timeout checks are never executed and your bolt will hold onto those tuples for a long time (potentially causing timeouts).  To handle this, we made use of tick tuples.  Tick tuples essentially allow you to you to send a special tuple to your bolt every Y seconds.  We use that to trigger checking the time constraint is checked on a regular basis (example being send a tick tuple every 1, 5, or 10 seconds)

On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello,

I want to use Storm to do three things :

1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP source

2.       Add additional information (based on sender email)

3.       Create an XML based on this data, to inject in another solution

Only issue, I want step 1 (and 2) to be as fast as possible so creating the maximum bolts/tasks possible,
But I want the XML to be as big as possible so gathering information for multiple output of bolts.

In this logic, I fi have 100 mails per second in original input, I would want to have step1 and step 2 to work on the smallest number of emails to do it faster.
But I still want to be able to have an XML that represent 10 000+ emails at the end.

I can’t think of topology to address this.
Can someone give me some pointers to the best way to handle this ?

Kind Regards,
Andréas Kalogéropoulos

Re: Using Storm to parse emails and creates batches

Posted by Stephen Powis <sp...@salesforce.com>.

Yep, sounds like you got it......you'd want to use field grouping and group
on a field that contains the hash.  Then every tuple that has that field
with the identical hashes would get sent to the same bolt instance.

On Tue, Dec 1, 2015 at 8:23 PM, Kalogeropoulos, Andreas <
Andreas.Kalogeropoulos@emc.com> wrote:

> Making sure that duplicates make it in the same XML file (third bolt).
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 11:59 AM
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> So you want to eliminate duplicates or make sure that duplicates make it
> into the same XML file (third bolt)?
>
>
>
> On Tue, Dec 1, 2015 at 7:48 PM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> Hello Stephen,
>
>
>
> Imagine that the spout is providing me 300 000 emails per hour.
>
> The first bolt will parse/analyze the information (from, to, cc, subject,
> object, date, has of attachments, …  , and probably will find the same hash
> for some attachments (someone forwarding an email).
>
>
>
> The last bolt will create an XML based on all this information, but if I
> can have the tuples containing the same attachment (based on hash) in the
> same XML, I can actually apply a dedup logic : having multiple lines in my
> xml pointing to the same file
>
>
>
> Does this make more sense ?
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 11:36 AM
>
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> I'm not sure I follow/understand your question or what you're trying to do.
>
>
>
> On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> You are right. Sorry for making you state the obvious J.
>
>
>
> Last question : If my spout has incoming information that I want to have
> in the same last bolt (the one creating the XML) for deduplication logic,
> what is the best way to achieve this ?  My instinct says to try to work
> with Fields grouping and the correct key (probably conversation since I am
> working with emails).
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 10:27 AM
>
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> If you are using Storm's guaranteed message processing
> <http://storm.apache.org/documentation/Guaranteeing-message-processing.html>
> there is no need to 'persist' the collection anywhere other than in
> memory.  IE List<Tuple> myListOfTuples = new ArrayList<Tuple>();  If the
> third bolt crashes and loses its in memory collection, after
> Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
> <http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS>
> the tuples will timeout and be replayed thru your entire storm topology and
> your collection will be repopulated
>
>
>
> On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> Hello Stephen,
>
>
>
> I think you got I correctly. Thanks a lot for the idea.
>
> If you have seen limitations, please send the disclaimers J . For
> example, how did you handle persistence of this collection ? If the third
> bolt failed while populating the collection (size and time has not been
> reached) we just lost everything, so I need to have a status loopback of
> what was really output. Right ?
>
>
>
> Of course, if you can send me the code of your third bolt (especially the
> collection handling), I’ll be grateful.
>
> In all cases, thanks a lot for your help, even without the code, you
> really give me example advice, and now I can start building something.
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Monday, November 30, 2015 5:55 PM
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> From what I understand from your description, you want bolt 3 to collect
> results from multiple tuples and build a single xml for them.  We've done
> this by essentially doing the following:
>
>
>
> Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the
> collection and check the size of the collection.  Once the size of the
> collection exceeds some number, we then process all of the tuples in one
> go, and then ACK all of them after the processing completes.
>
> Building on that, we've implemented an additional constraint on time.  If
> the collection size > N OR if we've waited more than X seconds, process the
> batch.  This way your output won't stall out if your topology has a lull in
> data being ingested.
>
> And then lastly, there's a corner case where say 10 tuples come in and get
> held by our collection but then no other tuples come in for a long period
> of time.  If no tuples enter, that means the size and timeout checks are
> never executed and your bolt will hold onto those tuples for a long time
> (potentially causing timeouts).  To handle this, we made use of tick
> tuples.  Tick tuples essentially allow you to you to send a special tuple
> to your bolt every Y seconds.  We use that to trigger checking the time
> constraint is checked on a regular basis (example being send a tick tuple
> every 1, 5, or 10 seconds)
>
>
>
> On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> Hello,
>
>
>
> I want to use Storm to do three things :
>
> 1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP
> source
>
> 2.       Add additional information (based on sender email)
>
> 3.       Create an XML based on this data, to inject in another solution
>
>
>
> Only issue, I want step 1 (and 2) to be as fast as possible so creating
> the maximum bolts/tasks possible,
>
> But I want the XML to be as big as possible so gathering information for
> multiple output of bolts.
>
>
>
> In this logic, I fi have 100 mails per second in original input, I would
> want to have step1 and step 2 to work on the smallest number of emails to
> do it faster.
>
> But I still want to be able to have an XML that represent 10 000+ emails
> at the end.
>
>
>
> I can’t think of topology to address this.
>
> Can someone give me some pointers to the best way to handle this ?
>
>
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
>
>
>
>
>
>
>
>

RE: Using Storm to parse emails and creates batches

Posted by "Kalogeropoulos, Andreas" <An...@emc.com>.

Making sure that duplicates make it in the same XML file (third bolt).

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com]
Sent: Tuesday, December 01, 2015 11:59 AM
To: user@storm.apache.org
Subject: Re: Using Storm to parse emails and creates batches

So you want to eliminate duplicates or make sure that duplicates make it into the same XML file (third bolt)?

On Tue, Dec 1, 2015 at 7:48 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello Stephen,

Imagine that the spout is providing me 300 000 emails per hour.
The first bolt will parse/analyze the information (from, to, cc, subject, object, date, has of attachments, …  , and probably will find the same hash for some attachments (someone forwarding an email).

The last bolt will create an XML based on all this information, but if I can have the tuples containing the same attachment (based on hash) in the same XML, I can actually apply a dedup logic : having multiple lines in my xml pointing to the same file

Does this make more sense ?

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Tuesday, December 01, 2015 11:36 AM

To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

I'm not sure I follow/understand your question or what you're trying to do.

On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
You are right. Sorry for making you state the obvious ☺.

Last question : If my spout has incoming information that I want to have in the same last bolt (the one creating the XML) for deduplication logic, what is the best way to achieve this ?  My instinct says to try to work with Fields grouping and the correct key (probably conversation since I am working with emails).

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Tuesday, December 01, 2015 10:27 AM

To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

If you are using Storm's guaranteed message processing<http://storm.apache.org/documentation/Guaranteeing-message-processing.html> there is no need to 'persist' the collection anywhere other than in memory.  IE List<Tuple> myListOfTuples = new ArrayList<Tuple>();  If the third bolt crashes and loses its in memory collection, after Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS<http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS> the tuples will timeout and be replayed thru your entire storm topology and your collection will be repopulated

On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello Stephen,

I think you got I correctly. Thanks a lot for the idea.
If you have seen limitations, please send the disclaimers ☺ . For example, how did you handle persistence of this collection ? If the third bolt failed while populating the collection (size and time has not been reached) we just lost everything, so I need to have a status loopback of what was really output. Right ?

Of course, if you can send me the code of your third bolt (especially the collection handling), I’ll be grateful.
In all cases, thanks a lot for your help, even without the code, you really give me example advice, and now I can start building something.

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Monday, November 30, 2015 5:55 PM
To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

From what I understand from your description, you want bolt 3 to collect results from multiple tuples and build a single xml for them.  We've done this by essentially doing the following:

Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the collection and check the size of the collection.  Once the size of the collection exceeds some number, we then process all of the tuples in one go, and then ACK all of them after the processing completes.

Building on that, we've implemented an additional constraint on time.  If the collection size > N OR if we've waited more than X seconds, process the batch.  This way your output won't stall out if your topology has a lull in data being ingested.
And then lastly, there's a corner case where say 10 tuples come in and get held by our collection but then no other tuples come in for a long period of time.  If no tuples enter, that means the size and timeout checks are never executed and your bolt will hold onto those tuples for a long time (potentially causing timeouts).  To handle this, we made use of tick tuples.  Tick tuples essentially allow you to you to send a special tuple to your bolt every Y seconds.  We use that to trigger checking the time constraint is checked on a regular basis (example being send a tick tuple every 1, 5, or 10 seconds)

On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello,

I want to use Storm to do three things :

1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP source

2.       Add additional information (based on sender email)

3.       Create an XML based on this data, to inject in another solution

Only issue, I want step 1 (and 2) to be as fast as possible so creating the maximum bolts/tasks possible,
But I want the XML to be as big as possible so gathering information for multiple output of bolts.

In this logic, I fi have 100 mails per second in original input, I would want to have step1 and step 2 to work on the smallest number of emails to do it faster.
But I still want to be able to have an XML that represent 10 000+ emails at the end.

I can’t think of topology to address this.
Can someone give me some pointers to the best way to handle this ?

Kind Regards,
Andréas Kalogéropoulos

Re: Using Storm to parse emails and creates batches

Posted by Stephen Powis <sp...@salesforce.com>.

So you want to eliminate duplicates or make sure that duplicates make it
into the same XML file (third bolt)?

On Tue, Dec 1, 2015 at 7:48 PM, Kalogeropoulos, Andreas <
Andreas.Kalogeropoulos@emc.com> wrote:

> Hello Stephen,
>
>
>
> Imagine that the spout is providing me 300 000 emails per hour.
>
> The first bolt will parse/analyze the information (from, to, cc, subject,
> object, date, has of attachments, …  , and probably will find the same hash
> for some attachments (someone forwarding an email).
>
>
>
> The last bolt will create an XML based on all this information, but if I
> can have the tuples containing the same attachment (based on hash) in the
> same XML, I can actually apply a dedup logic : having multiple lines in my
> xml pointing to the same file
>
>
>
> Does this make more sense ?
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 11:36 AM
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> I'm not sure I follow/understand your question or what you're trying to do.
>
>
>
> On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> You are right. Sorry for making you state the obvious J.
>
>
>
> Last question : If my spout has incoming information that I want to have
> in the same last bolt (the one creating the XML) for deduplication logic,
> what is the best way to achieve this ?  My instinct says to try to work
> with Fields grouping and the correct key (probably conversation since I am
> working with emails).
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 10:27 AM
>
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> If you are using Storm's guaranteed message processing
> <http://storm.apache.org/documentation/Guaranteeing-message-processing.html>
> there is no need to 'persist' the collection anywhere other than in
> memory.  IE List<Tuple> myListOfTuples = new ArrayList<Tuple>();  If the
> third bolt crashes and loses its in memory collection, after
> Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
> <http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS>
> the tuples will timeout and be replayed thru your entire storm topology and
> your collection will be repopulated
>
>
>
> On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> Hello Stephen,
>
>
>
> I think you got I correctly. Thanks a lot for the idea.
>
> If you have seen limitations, please send the disclaimers J . For
> example, how did you handle persistence of this collection ? If the third
> bolt failed while populating the collection (size and time has not been
> reached) we just lost everything, so I need to have a status loopback of
> what was really output. Right ?
>
>
>
> Of course, if you can send me the code of your third bolt (especially the
> collection handling), I’ll be grateful.
>
> In all cases, thanks a lot for your help, even without the code, you
> really give me example advice, and now I can start building something.
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Monday, November 30, 2015 5:55 PM
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> From what I understand from your description, you want bolt 3 to collect
> results from multiple tuples and build a single xml for them.  We've done
> this by essentially doing the following:
>
>
>
> Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the
> collection and check the size of the collection.  Once the size of the
> collection exceeds some number, we then process all of the tuples in one
> go, and then ACK all of them after the processing completes.
>
> Building on that, we've implemented an additional constraint on time.  If
> the collection size > N OR if we've waited more than X seconds, process the
> batch.  This way your output won't stall out if your topology has a lull in
> data being ingested.
>
> And then lastly, there's a corner case where say 10 tuples come in and get
> held by our collection but then no other tuples come in for a long period
> of time.  If no tuples enter, that means the size and timeout checks are
> never executed and your bolt will hold onto those tuples for a long time
> (potentially causing timeouts).  To handle this, we made use of tick
> tuples.  Tick tuples essentially allow you to you to send a special tuple
> to your bolt every Y seconds.  We use that to trigger checking the time
> constraint is checked on a regular basis (example being send a tick tuple
> every 1, 5, or 10 seconds)
>
>
>
> On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> Hello,
>
>
>
> I want to use Storm to do three things :
>
> 1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP
> source
>
> 2.       Add additional information (based on sender email)
>
> 3.       Create an XML based on this data, to inject in another solution
>
>
>
> Only issue, I want step 1 (and 2) to be as fast as possible so creating
> the maximum bolts/tasks possible,
>
> But I want the XML to be as big as possible so gathering information for
> multiple output of bolts.
>
>
>
> In this logic, I fi have 100 mails per second in original input, I would
> want to have step1 and step 2 to work on the smallest number of emails to
> do it faster.
>
> But I still want to be able to have an XML that represent 10 000+ emails
> at the end.
>
>
>
> I can’t think of topology to address this.
>
> Can someone give me some pointers to the best way to handle this ?
>
>
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
>
>
>
>
>
>

RE: Using Storm to parse emails and creates batches

Posted by "Kalogeropoulos, Andreas" <An...@emc.com>.

Hello Stephen,

Imagine that the spout is providing me 300 000 emails per hour.
The first bolt will parse/analyze the information (from, to, cc, subject, object, date, has of attachments, …  , and probably will find the same hash for some attachments (someone forwarding an email).

The last bolt will create an XML based on all this information, but if I can have the tuples containing the same attachment (based on hash) in the same XML, I can actually apply a dedup logic : having multiple lines in my xml pointing to the same file

Does this make more sense ?

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com]
Sent: Tuesday, December 01, 2015 11:36 AM
To: user@storm.apache.org
Subject: Re: Using Storm to parse emails and creates batches

I'm not sure I follow/understand your question or what you're trying to do.

On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
You are right. Sorry for making you state the obvious ☺.

Last question : If my spout has incoming information that I want to have in the same last bolt (the one creating the XML) for deduplication logic, what is the best way to achieve this ?  My instinct says to try to work with Fields grouping and the correct key (probably conversation since I am working with emails).

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Tuesday, December 01, 2015 10:27 AM

To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

If you are using Storm's guaranteed message processing<http://storm.apache.org/documentation/Guaranteeing-message-processing.html> there is no need to 'persist' the collection anywhere other than in memory.  IE List<Tuple> myListOfTuples = new ArrayList<Tuple>();  If the third bolt crashes and loses its in memory collection, after Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS<http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS> the tuples will timeout and be replayed thru your entire storm topology and your collection will be repopulated

On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello Stephen,

I think you got I correctly. Thanks a lot for the idea.
If you have seen limitations, please send the disclaimers ☺ . For example, how did you handle persistence of this collection ? If the third bolt failed while populating the collection (size and time has not been reached) we just lost everything, so I need to have a status loopback of what was really output. Right ?

Of course, if you can send me the code of your third bolt (especially the collection handling), I’ll be grateful.
In all cases, thanks a lot for your help, even without the code, you really give me example advice, and now I can start building something.

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Monday, November 30, 2015 5:55 PM
To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

From what I understand from your description, you want bolt 3 to collect results from multiple tuples and build a single xml for them.  We've done this by essentially doing the following:

Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the collection and check the size of the collection.  Once the size of the collection exceeds some number, we then process all of the tuples in one go, and then ACK all of them after the processing completes.

Building on that, we've implemented an additional constraint on time.  If the collection size > N OR if we've waited more than X seconds, process the batch.  This way your output won't stall out if your topology has a lull in data being ingested.
And then lastly, there's a corner case where say 10 tuples come in and get held by our collection but then no other tuples come in for a long period of time.  If no tuples enter, that means the size and timeout checks are never executed and your bolt will hold onto those tuples for a long time (potentially causing timeouts).  To handle this, we made use of tick tuples.  Tick tuples essentially allow you to you to send a special tuple to your bolt every Y seconds.  We use that to trigger checking the time constraint is checked on a regular basis (example being send a tick tuple every 1, 5, or 10 seconds)

On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello,

I want to use Storm to do three things :

1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP source

2.       Add additional information (based on sender email)

3.       Create an XML based on this data, to inject in another solution

Only issue, I want step 1 (and 2) to be as fast as possible so creating the maximum bolts/tasks possible,
But I want the XML to be as big as possible so gathering information for multiple output of bolts.

In this logic, I fi have 100 mails per second in original input, I would want to have step1 and step 2 to work on the smallest number of emails to do it faster.
But I still want to be able to have an XML that represent 10 000+ emails at the end.

I can’t think of topology to address this.
Can someone give me some pointers to the best way to handle this ?


Kind Regards,
Andréas Kalogéropoulos

Re: Using Storm to parse emails and creates batches

Posted by Stephen Powis <sp...@salesforce.com>.

I'm not sure I follow/understand your question or what you're trying to do.

On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas <
Andreas.Kalogeropoulos@emc.com> wrote:

> You are right. Sorry for making you state the obvious J.
>
>
>
> Last question : If my spout has incoming information that I want to have
> in the same last bolt (the one creating the XML) for deduplication logic,
> what is the best way to achieve this ?  My instinct says to try to work
> with Fields grouping and the correct key (probably conversation since I am
> working with emails).
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 10:27 AM
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> If you are using Storm's guaranteed message processing
> <http://storm.apache.org/documentation/Guaranteeing-message-processing.html>
> there is no need to 'persist' the collection anywhere other than in
> memory.  IE List<Tuple> myListOfTuples = new ArrayList<Tuple>();  If the
> third bolt crashes and loses its in memory collection, after
> Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
> <http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS>
> the tuples will timeout and be replayed thru your entire storm topology and
> your collection will be repopulated
>
>
>
> On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> Hello Stephen,
>
>
>
> I think you got I correctly. Thanks a lot for the idea.
>
> If you have seen limitations, please send the disclaimers J . For
> example, how did you handle persistence of this collection ? If the third
> bolt failed while populating the collection (size and time has not been
> reached) we just lost everything, so I need to have a status loopback of
> what was really output. Right ?
>
>
>
> Of course, if you can send me the code of your third bolt (especially the
> collection handling), I’ll be grateful.
>
> In all cases, thanks a lot for your help, even without the code, you
> really give me example advice, and now I can start building something.
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Monday, November 30, 2015 5:55 PM
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> From what I understand from your description, you want bolt 3 to collect
> results from multiple tuples and build a single xml for them.  We've done
> this by essentially doing the following:
>
>
>
> Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the
> collection and check the size of the collection.  Once the size of the
> collection exceeds some number, we then process all of the tuples in one
> go, and then ACK all of them after the processing completes.
>
> Building on that, we've implemented an additional constraint on time.  If
> the collection size > N OR if we've waited more than X seconds, process the
> batch.  This way your output won't stall out if your topology has a lull in
> data being ingested.
>
> And then lastly, there's a corner case where say 10 tuples come in and get
> held by our collection but then no other tuples come in for a long period
> of time.  If no tuples enter, that means the size and timeout checks are
> never executed and your bolt will hold onto those tuples for a long time
> (potentially causing timeouts).  To handle this, we made use of tick
> tuples.  Tick tuples essentially allow you to you to send a special tuple
> to your bolt every Y seconds.  We use that to trigger checking the time
> constraint is checked on a regular basis (example being send a tick tuple
> every 1, 5, or 10 seconds)
>
>
>
> On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> Hello,
>
>
>
> I want to use Storm to do three things :
>
> 1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP
> source
>
> 2.       Add additional information (based on sender email)
>
> 3.       Create an XML based on this data, to inject in another solution
>
>
>
> Only issue, I want step 1 (and 2) to be as fast as possible so creating
> the maximum bolts/tasks possible,
>
> But I want the XML to be as big as possible so gathering information for
> multiple output of bolts.
>
>
>
> In this logic, I fi have 100 mails per second in original input, I would
> want to have step1 and step 2 to work on the smallest number of emails to
> do it faster.
>
> But I still want to be able to have an XML that represent 10 000+ emails
> at the end.
>
>
>
> I can’t think of topology to address this.
>
> Can someone give me some pointers to the best way to handle this ?
>
>
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
>
>
>
>

RE: Using Storm to parse emails and creates batches

Posted by "Kalogeropoulos, Andreas" <An...@emc.com>.

You are right. Sorry for making you state the obvious ☺.

Last question : If my spout has incoming information that I want to have in the same last bolt (the one creating the XML) for deduplication logic, what is the best way to achieve this ?  My instinct says to try to work with Fields grouping and the correct key (probably conversation since I am working with emails).

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com]
Sent: Tuesday, December 01, 2015 10:27 AM
To: user@storm.apache.org
Subject: Re: Using Storm to parse emails and creates batches

If you are using Storm's guaranteed message processing<http://storm.apache.org/documentation/Guaranteeing-message-processing.html> there is no need to 'persist' the collection anywhere other than in memory.  IE List<Tuple> myListOfTuples = new ArrayList<Tuple>();  If the third bolt crashes and loses its in memory collection, after Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS<http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS> the tuples will timeout and be replayed thru your entire storm topology and your collection will be repopulated

On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello Stephen,

I think you got I correctly. Thanks a lot for the idea.
If you have seen limitations, please send the disclaimers ☺ . For example, how did you handle persistence of this collection ? If the third bolt failed while populating the collection (size and time has not been reached) we just lost everything, so I need to have a status loopback of what was really output. Right ?

Of course, if you can send me the code of your third bolt (especially the collection handling), I’ll be grateful.
In all cases, thanks a lot for your help, even without the code, you really give me example advice, and now I can start building something.

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com<ma...@salesforce.com>]
Sent: Monday, November 30, 2015 5:55 PM
To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Using Storm to parse emails and creates batches

From what I understand from your description, you want bolt 3 to collect results from multiple tuples and build a single xml for them.  We've done this by essentially doing the following:

Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the collection and check the size of the collection.  Once the size of the collection exceeds some number, we then process all of the tuples in one go, and then ACK all of them after the processing completes.

Building on that, we've implemented an additional constraint on time.  If the collection size > N OR if we've waited more than X seconds, process the batch.  This way your output won't stall out if your topology has a lull in data being ingested.
And then lastly, there's a corner case where say 10 tuples come in and get held by our collection but then no other tuples come in for a long period of time.  If no tuples enter, that means the size and timeout checks are never executed and your bolt will hold onto those tuples for a long time (potentially causing timeouts).  To handle this, we made use of tick tuples.  Tick tuples essentially allow you to you to send a special tuple to your bolt every Y seconds.  We use that to trigger checking the time constraint is checked on a regular basis (example being send a tick tuple every 1, 5, or 10 seconds)

On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello,

I want to use Storm to do three things :

1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP source

2.       Add additional information (based on sender email)

3.       Create an XML based on this data, to inject in another solution

Only issue, I want step 1 (and 2) to be as fast as possible so creating the maximum bolts/tasks possible,
But I want the XML to be as big as possible so gathering information for multiple output of bolts.

In this logic, I fi have 100 mails per second in original input, I would want to have step1 and step 2 to work on the smallest number of emails to do it faster.
But I still want to be able to have an XML that represent 10 000+ emails at the end.

I can’t think of topology to address this.
Can someone give me some pointers to the best way to handle this ?

Kind Regards,
Andréas Kalogéropoulos

Re: Using Storm to parse emails and creates batches

Posted by Stephen Powis <sp...@salesforce.com>.

If you are using Storm's guaranteed message processing
<http://storm.apache.org/documentation/Guaranteeing-message-processing.html>
there is no need to 'persist' the collection anywhere other than in
memory.  IE List<Tuple> myListOfTuples = new ArrayList<Tuple>();  If the
third bolt crashes and loses its in memory collection, after
Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
<http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS>
the tuples will timeout and be replayed thru your entire storm topology and
your collection will be repopulated

On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <
Andreas.Kalogeropoulos@emc.com> wrote:

> Hello Stephen,
>
>
>
> I think you got I correctly. Thanks a lot for the idea.
>
> If you have seen limitations, please send the disclaimers J . For
> example, how did you handle persistence of this collection ? If the third
> bolt failed while populating the collection (size and time has not been
> reached) we just lost everything, so I need to have a status loopback of
> what was really output. Right ?
>
>
>
> Of course, if you can send me the code of your third bolt (especially the
> collection handling), I’ll be grateful.
>
> In all cases, thanks a lot for your help, even without the code, you
> really give me example advice, and now I can start building something.
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Monday, November 30, 2015 5:55 PM
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> From what I understand from your description, you want bolt 3 to collect
> results from multiple tuples and build a single xml for them.  We've done
> this by essentially doing the following:
>
>
>
> Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the
> collection and check the size of the collection.  Once the size of the
> collection exceeds some number, we then process all of the tuples in one
> go, and then ACK all of them after the processing completes.
>
> Building on that, we've implemented an additional constraint on time.  If
> the collection size > N OR if we've waited more than X seconds, process the
> batch.  This way your output won't stall out if your topology has a lull in
> data being ingested.
>
> And then lastly, there's a corner case where say 10 tuples come in and get
> held by our collection but then no other tuples come in for a long period
> of time.  If no tuples enter, that means the size and timeout checks are
> never executed and your bolt will hold onto those tuples for a long time
> (potentially causing timeouts).  To handle this, we made use of tick
> tuples.  Tick tuples essentially allow you to you to send a special tuple
> to your bolt every Y seconds.  We use that to trigger checking the time
> constraint is checked on a regular basis (example being send a tick tuple
> every 1, 5, or 10 seconds)
>
>
>
> On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> Hello,
>
>
>
> I want to use Storm to do three things :
>
> 1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP
> source
>
> 2.       Add additional information (based on sender email)
>
> 3.       Create an XML based on this data, to inject in another solution
>
>
>
> Only issue, I want step 1 (and 2) to be as fast as possible so creating
> the maximum bolts/tasks possible,
>
> But I want the XML to be as big as possible so gathering information for
> multiple output of bolts.
>
>
>
> In this logic, I fi have 100 mails per second in original input, I would
> want to have step1 and step 2 to work on the smallest number of emails to
> do it faster.
>
> But I still want to be able to have an XML that represent 10 000+ emails
> at the end.
>
>
>
> I can’t think of topology to address this.
>
> Can someone give me some pointers to the best way to handle this ?
>
>
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
>
>

RE: Using Storm to parse emails and creates batches

Posted by "Kalogeropoulos, Andreas" <An...@emc.com>.

Hello Stephen,

I think you got I correctly. Thanks a lot for the idea.
If you have seen limitations, please send the disclaimers ☺ . For example, how did you handle persistence of this collection ? If the third bolt failed while populating the collection (size and time has not been reached) we just lost everything, so I need to have a status loopback of what was really output. Right ?

Of course, if you can send me the code of your third bolt (especially the collection handling), I’ll be grateful.
In all cases, thanks a lot for your help, even without the code, you really give me example advice, and now I can start building something.

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com]
Sent: Monday, November 30, 2015 5:55 PM
To: user@storm.apache.org
Subject: Re: Using Storm to parse emails and creates batches

From what I understand from your description, you want bolt 3 to collect results from multiple tuples and build a single xml for them.  We've done this by essentially doing the following:

Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the collection and check the size of the collection.  Once the size of the collection exceeds some number, we then process all of the tuples in one go, and then ACK all of them after the processing completes.

Building on that, we've implemented an additional constraint on time.  If the collection size > N OR if we've waited more than X seconds, process the batch.  This way your output won't stall out if your topology has a lull in data being ingested.
And then lastly, there's a corner case where say 10 tuples come in and get held by our collection but then no other tuples come in for a long period of time.  If no tuples enter, that means the size and timeout checks are never executed and your bolt will hold onto those tuples for a long time (potentially causing timeouts).  To handle this, we made use of tick tuples.  Tick tuples essentially allow you to you to send a special tuple to your bolt every Y seconds.  We use that to trigger checking the time constraint is checked on a regular basis (example being send a tick tuple every 1, 5, or 10 seconds)

On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <An...@emc.com>> wrote:
Hello,

I want to use Storm to do three things :

1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP source

2.       Add additional information (based on sender email)

3.       Create an XML based on this data, to inject in another solution

Only issue, I want step 1 (and 2) to be as fast as possible so creating the maximum bolts/tasks possible,
But I want the XML to be as big as possible so gathering information for multiple output of bolts.

In this logic, I fi have 100 mails per second in original input, I would want to have step1 and step 2 to work on the smallest number of emails to do it faster.
But I still want to be able to have an XML that represent 10 000+ emails at the end.

I can’t think of topology to address this.
Can someone give me some pointers to the best way to handle this ?


Kind Regards,
Andréas Kalogéropoulos

Re: Using Storm to parse emails and creates batches

Posted by Stephen Powis <sp...@salesforce.com>.

>From what I understand from your description, you want bolt 3 to collect
results from multiple tuples and build a single xml for them.  We've done
this by essentially doing the following:

Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the
collection and check the size of the collection.  Once the size of the
collection exceeds some number, we then process all of the tuples in one
go, and then ACK all of them after the processing completes.

Building on that, we've implemented an additional constraint on time.  If
the collection size > N OR if we've waited more than X seconds, process the
batch.  This way your output won't stall out if your topology has a lull in
data being ingested.

And then lastly, there's a corner case where say 10 tuples come in and get
held by our collection but then no other tuples come in for a long period
of time.  If no tuples enter, that means the size and timeout checks are
never executed and your bolt will hold onto those tuples for a long time
(potentially causing timeouts).  To handle this, we made use of tick
tuples.  Tick tuples essentially allow you to you to send a special tuple
to your bolt every Y seconds.  We use that to trigger checking the time
constraint is checked on a regular basis (example being send a tick tuple
every 1, 5, or 10 seconds)

On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <
Andreas.Kalogeropoulos@emc.com> wrote:

> Hello,
>
>
>
> I want to use Storm to do three things :
>
> 1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP
> source
>
> 2.       Add additional information (based on sender email)
>
> 3.       Create an XML based on this data, to inject in another solution
>
>
>
> Only issue, I want step 1 (and 2) to be as fast as possible so creating
> the maximum bolts/tasks possible,
>
> But I want the XML to be as big as possible so gathering information for
> multiple output of bolts.
>
>
>
> In this logic, I fi have 100 mails per second in original input, I would
> want to have step1 and step 2 to work on the smallest number of emails to
> do it faster.
>
> But I still want to be able to have an XML that represent 10 000+ emails
> at the end.
>
>
>
> I can’t think of topology to address this.
>
> Can someone give me some pointers to the best way to handle this ?
>
>
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>