You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Sean Solbak <se...@solbak.ca> on 2014/03/02 20:23:32 UTC

Tuning and nimbus at 99%

Im writing a fairly basic trident topology as follows:

- 4 spouts of events
- merges into one stream
- serializes the object as an event in a string
- saves to db

I split the serialization task away from the spout as it was cpu intensive
to speed it up.

The problem I have is that after 10 minutes there is over 910k tuples
emitted/transfered but only 193k records are saved.

The overall load of the topology seems fine.

- 536.404 ms complete latency at the topolgy level
- The highest capacity of any bolt is 0.3 which is the serialization one.
- each bolt task has sub 20 ms execute latency and sub 40 ms process
latency.

So it seems trident has all the records internally, but I need these events
as close to realtime as possible.

Does anyone have any guidance as to how to increase the throughput?  Is it
simply a matter of tweeking max spout pending and the batch size?

Im running it on 2 m1-smalls for now.  I dont see the need to upgrade it
until the demand on the boxes seems higher.  Although CPU usage on the
nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at 99%
even when all the topologies are killed.

We are currently targeting processing 200 million records per day which
seems like it should be quite easy based on what Ive read that other people
have achieved.  I realize that hardware should be able to boost this as
well but my first goal is to get trident to push the records to the db
quicker.

Thanks in advance,
Sean

Re: Tuning and nimbus at 99%

Posted by Sean Solbak <se...@solbak.ca>.


Sent from my iPhone

> On Mar 2, 2014, at 5:46 PM, Sean Allen <se...@monkeysnatchbanana.com> wrote:
> 
> Is there a reason you are using trident? 
> 
> If you don't need to handle the events as a batch, you are probably going to get performance w/o it.
> 
> 
>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca> wrote:
>> Im writing a fairly basic trident topology as follows:
>> 
>> - 4 spouts of events
>> - merges into one stream
>> - serializes the object as an event in a string
>> - saves to db
>> 
>> I split the serialization task away from the spout as it was cpu intensive to speed it up.
>> 
>> The problem I have is that after 10 minutes there is over 910k tuples emitted/transfered but only 193k records are saved.
>> 
>> The overall load of the topology seems fine.
>>  
>> - 536.404 ms complete latency at the topolgy level
>> - The highest capacity of any bolt is 0.3 which is the serialization one.
>> - each bolt task has sub 20 ms execute latency and sub 40 ms process latency.
>> 
>> So it seems trident has all the records internally, but I need these events as close to realtime as possible.
>> 
>> Does anyone have any guidance as to how to increase the throughput?  Is it simply a matter of tweeking max spout pending and the batch size?
>> 
>> Im running it on 2 m1-smalls for now.  I dont see the need to upgrade it until the demand on the boxes seems higher.  Although CPU usage on the nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at 99% even when all the topologies are killed.
>> 
>> We are currently targeting processing 200 million records per day which seems like it should be quite easy based on what Ive read that other people have achieved.  I realize that hardware should be able to boost this as well but my first goal is to get trident to push the records to the db quicker.
>> 
>> Thanks in advance,
>> Sean
> 
> 
> 
> -- 
> 
> Ce n'est pas une signature

Re: Tuning and nimbus at 99%

Posted by Sean Solbak <se...@solbak.ca>.

The hard drive was at 18%.

If its not disk space related, it must be some kind of memory overflow?
 Hard to say as nothing was running yet.

After killing the nimbus process and restarting.  Its calmed down.  I'll
follow up in the morn or if it happens again.   Im starting to wonder if I
should move away from m1-smalls as I cant have these random spikes in prod.

Thanks a bunch Otis and Michael!

S






On Mon, Mar 3, 2014 at 8:18 PM, Otis Gospodnetic <otis.gospodnetic@gmail.com
> wrote:

> Another possibility: sudo grep -i kill /var/log/messages*
> See
> http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Mon, Mar 3, 2014 at 8:54 PM, Michael Rose <mi...@fullcontact.com>wrote:
>
>> Otis,
>>
>> I'm a fan of SPM for Storm, but there's other debugging that needs to be
>> done here if the process quits constantly.
>>
>> Sean,
>>
>> Since you're using storm-deploy, I assume the processes are running under
>> supervisor. It might be worth killing the supervisor by hand, then running
>> it yourself (ssh as storm, cd storm/daemon, supervise .) and seeing what
>> kind of errors you see.
>>
>> Are your disks perhaps filled?
>>
>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>> michael@fullcontact.com
>>
>>
>> On Mon, Mar 3, 2014 at 6:49 PM, Otis Gospodnetic <
>> otis.gospodnetic@gmail.com> wrote:
>>
>>> Hi Sean,
>>>
>>> I don't think you can see the metrics you need to see with AWS
>>> CloudWatch.  Have a look at SPM for Storm.  You can share graphs from SPM
>>> directly if you want, so you don't have to grab and attach screenshots
>>> manually. See:
>>>
>>> http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/+
>>> http://sematext.com/spm/
>>>
>>> My bet is that you'll see GC metrics spikes....
>>>
>>> Otis
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>> On Mon, Mar 3, 2014 at 8:21 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>
>>>> I just created a brand new cluster with storm-deploy command.
>>>>
>>>> lein deploy-storm --start --name storm-dev --commit
>>>> 1bcc169f5096e03a4ae117efc65c0f9bcfa2fa22
>>>>
>>>>  I had a meeting, did nothing to the box, no topologies were run.  I
>>>> came back 2 hours later and nimbus was at 100% cpu.
>>>>
>>>> I'm running on an m1-small on the following ami - ami-58a3cf68. Im
>>>> unable to get a threaddump as the process is getting killed and restarted
>>>> too fast.  I did attach a 3 hour snapshot of the ec2 monitors.  Any
>>>> guidance would be much appreciated.
>>>>
>>>> Thanks,
>>>> S
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 9:11 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>
>>>>> The only error in the logs is which happened over 10 days ago was.
>>>>>
>>>>> 2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event
>>>>> java.io.IOException: Unable to delete directory
>>>>> /mnt/storm/nimbus/stormdist/test-25-1393022928.
>>>>>         at
>>>>> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981)
>>>>> ~[commons-io-1.4.jar:1.4]
>>>>>         at
>>>>> org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
>>>>> ~[commons-io-1.4.jar:1.4]
>>>>>         at backtype.storm.util$rmr.invoke(util.clj:442)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at
>>>>> backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at
>>>>> backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at
>>>>> backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at
>>>>> backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26)
>>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>>         at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
>>>>>         at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27]
>>>>>
>>>>> Its fine.  I can rebuild a new cluster.  Storm deploy makes it pretty
>>>>> easy.
>>>>>
>>>>> Thanks for you help on this!
>>>>>
>>>>> As for my other question.
>>>>>
>>>>> If my trident batch interval is 500ms and I keep the spout pending and
>>>>> batch size small enough, will I be able to get real time results (ie sub 2
>>>>> seconds)?  I've played with the various metrics (I literally have a
>>>>> spreadsheet of parameters to results) and haven't been able to get it.  Am
>>>>> I just doing it wrong?  What would the key parameters be?  The complete
>>>>> latency is 500 ms but trident seems to be way behind despite non of my
>>>>> bolts having a capacity > 0.6.  This may have to do with nimbus being
>>>>> throttled so I will report back.  But if there are people out there who
>>>>> have done this kind of thing, Id like to know if Im missing an obvious
>>>>> parameter or something.
>>>>>
>>>>> Thanks,
>>>>> S
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>>>
>>>>>> The fact that the process is being killed constantly is a red flag.
>>>>>> Also, why are you running it as a client VM?
>>>>>>
>>>>>> Check your nimbus.log to see why it's restarting.
>>>>>>
>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>> michael@fullcontact.com
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>
>>>>>>>   uintx ErgoHeapSizeLimit                         = 0
>>>>>>> {product}
>>>>>>>     uintx InitialHeapSize                          := 27080896
>>>>>>>  {product}
>>>>>>>     uintx LargePageHeapSizeThreshold                = 134217728
>>>>>>>   {product}
>>>>>>>     uintx MaxHeapSize                              := 698351616
>>>>>>>   {product}
>>>>>>>
>>>>>>>
>>>>>>> so initial size of ~25mb and max of ~666 mb
>>>>>>>
>>>>>>> Its a client process (not server ie the command is "java -client
>>>>>>> -Dstorm.options...").  The process gets killed and restarted continously
>>>>>>> with a new PID (which makes getting the PID tough to get stats on).  I dont
>>>>>>> have VisualVM but if I run
>>>>>>>
>>>>>>> jstat -gc PID, I get
>>>>>>>
>>>>>>>  S0C    S1C    S0U    S1U      EC       EU        OC         OU
>>>>>>>   PC     PU    YGC     YGCT    FGC    FGCT     GCT
>>>>>>> 832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0
>>>>>>> 21248.0 16029.6      5    0.268   0      0.000    0.268
>>>>>>>
>>>>>>> At this point I'll likely just rebuild the cluster.  Its not in prod
>>>>>>> yet as I still need to tune it.  I should have wrote 2 separate emails :)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> S
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <
>>>>>>> michael@fullcontact.com> wrote:
>>>>>>>
>>>>>>>> I'm not seeing too much to substantiate that. What size heap are
>>>>>>>> you running, and is it near filled? Perhaps attach VisualVM and check for
>>>>>>>> GC activity.
>>>>>>>>
>>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>>>> michael@fullcontact.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>>>
>>>>>>>>> Here it is.  Appears to be some kind of race condition.
>>>>>>>>>
>>>>>>>>> http://pastebin.com/dANT8SQR
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <
>>>>>>>>> michael@fullcontact.com> wrote:
>>>>>>>>>
>>>>>>>>>> Can you do a thread dump and pastebin it? It's a nice first step
>>>>>>>>>> to figure this out.
>>>>>>>>>>
>>>>>>>>>> I just checked on our Nimbus and while it's on a larger machine,
>>>>>>>>>> it's using <1% CPU. Also look in your logs for any clues.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/>
>>>>>>>>>> michael@fullcontact.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <se...@solbak.ca>wrote:
>>>>>>>>>>
>>>>>>>>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>>>>>>>>> workers, 1 nimbus and 1 zookeeper.
>>>>>>>>>>>
>>>>>>>>>>> I suppose I could just create a new cluster but Id like to know
>>>>>>>>>>> why this is occurring to avoid future production outages.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> S
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <
>>>>>>>>>>> michael@fullcontact.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>>>>>>>>
>>>>>>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/>
>>>>>>>>>>>> michael@fullcontact.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca>wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> This is the first step of 4. When I save to db I'm actually
>>>>>>>>>>>>> saving to a queue, (just using db for now).  The 2nd step we index the data
>>>>>>>>>>>>> and 3rd we do aggregation/counts for reporting.  The last is a search that
>>>>>>>>>>>>> I'm planning on using drpc for.  Within step 2 we pipe certain datasets in
>>>>>>>>>>>>> real time to the clients it applies to.  I'd like this and the drpc to be
>>>>>>>>>>>>> sub 2s which should be reasonable.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Your right that I could speed up step 1 by not using trident
>>>>>>>>>>>>> but our requirements seem like a good use case for the other 3 steps.  With
>>>>>>>>>>>>> many results per second batching should effect performance a ton if the
>>>>>>>>>>>>> batch size is small enough.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies
>>>>>>>>>>>>> killed?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <
>>>>>>>>>>>>> sean@monkeysnatchbanana.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a reason you are using trident?
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you don't need to handle the events as a batch, you are
>>>>>>>>>>>>> probably going to get performance w/o it.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca>wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - 4 spouts of events
>>>>>>>>>>>>>> - merges into one stream
>>>>>>>>>>>>>> - serializes the object as an event in a string
>>>>>>>>>>>>>> - saves to db
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I split the serialization task away from the spout as it was
>>>>>>>>>>>>>> cpu intensive to speed it up.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The problem I have is that after 10 minutes there is over
>>>>>>>>>>>>>> 910k tuples emitted/transfered but only 193k records are saved.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The overall load of the topology seems fine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the
>>>>>>>>>>>>>> serialization one.
>>>>>>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms
>>>>>>>>>>>>>> process latency.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So it seems trident has all the records internally, but I
>>>>>>>>>>>>>> need these events as close to realtime as possible.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Does anyone have any guidance as to how to increase the
>>>>>>>>>>>>>> throughput?  Is it simply a matter of tweeking max spout pending and the
>>>>>>>>>>>>>> batch size?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>>>>>>>>> upgrade it until the demand on the boxes seems higher.  Although CPU usage
>>>>>>>>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at
>>>>>>>>>>>>>> 99% even when all the topologies are killed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are currently targeting processing 200 million records per
>>>>>>>>>>>>>> day which seems like it should be quite easy based on what Ive read that
>>>>>>>>>>>>>> other people have achieved.  I realize that hardware should be able to
>>>>>>>>>>>>>> boost this as well but my first goal is to get trident to push the records
>>>>>>>>>>>>>> to the db quicker.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>> Sean
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ce n'est pas une signature
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>>>>>> Solbak Technologies Inc.
>>>>>>>>>>> 780.893.7326 (m)
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>>>> Solbak Technologies Inc.
>>>>>>>>> 780.893.7326 (m)
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>> Solbak Technologies Inc.
>>>>>>> 780.893.7326 (m)
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>>
>>>>> Sean Solbak, BsC, MCSD
>>>>> Solbak Technologies Inc.
>>>>> 780.893.7326 (m)
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> Sean Solbak, BsC, MCSD
>>>> Solbak Technologies Inc.
>>>> 780.893.7326 (m)
>>>>
>>>
>>>
>>
>


-- 
Thanks,

Sean Solbak, BsC, MCSD
Solbak Technologies Inc.
780.893.7326 (m)

Re: Tuning and nimbus at 99%

Posted by Otis Gospodnetic <ot...@gmail.com>.

Another possibility: sudo grep -i kill /var/log/messages*
See
http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Mon, Mar 3, 2014 at 8:54 PM, Michael Rose <mi...@fullcontact.com>wrote:

> Otis,
>
> I'm a fan of SPM for Storm, but there's other debugging that needs to be
> done here if the process quits constantly.
>
> Sean,
>
> Since you're using storm-deploy, I assume the processes are running under
> supervisor. It might be worth killing the supervisor by hand, then running
> it yourself (ssh as storm, cd storm/daemon, supervise .) and seeing what
> kind of errors you see.
>
> Are your disks perhaps filled?
>
> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> michael@fullcontact.com
>
>
> On Mon, Mar 3, 2014 at 6:49 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> Hi Sean,
>>
>> I don't think you can see the metrics you need to see with AWS
>> CloudWatch.  Have a look at SPM for Storm.  You can share graphs from SPM
>> directly if you want, so you don't have to grab and attach screenshots
>> manually. See:
>>
>> http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/+
>> http://sematext.com/spm/
>>
>> My bet is that you'll see GC metrics spikes....
>>
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Mon, Mar 3, 2014 at 8:21 PM, Sean Solbak <se...@solbak.ca> wrote:
>>
>>> I just created a brand new cluster with storm-deploy command.
>>>
>>> lein deploy-storm --start --name storm-dev --commit
>>> 1bcc169f5096e03a4ae117efc65c0f9bcfa2fa22
>>>
>>>  I had a meeting, did nothing to the box, no topologies were run.  I
>>> came back 2 hours later and nimbus was at 100% cpu.
>>>
>>> I'm running on an m1-small on the following ami - ami-58a3cf68. Im
>>> unable to get a threaddump as the process is getting killed and restarted
>>> too fast.  I did attach a 3 hour snapshot of the ec2 monitors.  Any
>>> guidance would be much appreciated.
>>>
>>> Thanks,
>>> S
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Mar 2, 2014 at 9:11 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>
>>>> The only error in the logs is which happened over 10 days ago was.
>>>>
>>>> 2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event
>>>> java.io.IOException: Unable to delete directory
>>>> /mnt/storm/nimbus/stormdist/test-25-1393022928.
>>>>         at
>>>> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981)
>>>> ~[commons-io-1.4.jar:1.4]
>>>>         at
>>>> org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
>>>> ~[commons-io-1.4.jar:1.4]
>>>>         at backtype.storm.util$rmr.invoke(util.clj:442)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at
>>>> backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at
>>>> backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at
>>>> backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at
>>>> backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26)
>>>> ~[storm-core-0.9.0.1.jar:na]
>>>>         at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
>>>>         at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27]
>>>>
>>>> Its fine.  I can rebuild a new cluster.  Storm deploy makes it pretty
>>>> easy.
>>>>
>>>> Thanks for you help on this!
>>>>
>>>> As for my other question.
>>>>
>>>> If my trident batch interval is 500ms and I keep the spout pending and
>>>> batch size small enough, will I be able to get real time results (ie sub 2
>>>> seconds)?  I've played with the various metrics (I literally have a
>>>> spreadsheet of parameters to results) and haven't been able to get it.  Am
>>>> I just doing it wrong?  What would the key parameters be?  The complete
>>>> latency is 500 ms but trident seems to be way behind despite non of my
>>>> bolts having a capacity > 0.6.  This may have to do with nimbus being
>>>> throttled so I will report back.  But if there are people out there who
>>>> have done this kind of thing, Id like to know if Im missing an obvious
>>>> parameter or something.
>>>>
>>>> Thanks,
>>>> S
>>>>
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>>
>>>>> The fact that the process is being killed constantly is a red flag.
>>>>> Also, why are you running it as a client VM?
>>>>>
>>>>> Check your nimbus.log to see why it's restarting.
>>>>>
>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>> michael@fullcontact.com
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>
>>>>>>   uintx ErgoHeapSizeLimit                         = 0
>>>>>> {product}
>>>>>>     uintx InitialHeapSize                          := 27080896
>>>>>>  {product}
>>>>>>     uintx LargePageHeapSizeThreshold                = 134217728
>>>>>> {product}
>>>>>>     uintx MaxHeapSize                              := 698351616
>>>>>> {product}
>>>>>>
>>>>>>
>>>>>> so initial size of ~25mb and max of ~666 mb
>>>>>>
>>>>>> Its a client process (not server ie the command is "java -client
>>>>>> -Dstorm.options...").  The process gets killed and restarted continously
>>>>>> with a new PID (which makes getting the PID tough to get stats on).  I dont
>>>>>> have VisualVM but if I run
>>>>>>
>>>>>> jstat -gc PID, I get
>>>>>>
>>>>>>  S0C    S1C    S0U    S1U      EC       EU        OC         OU
>>>>>> PC     PU    YGC     YGCT    FGC    FGCT     GCT
>>>>>> 832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0
>>>>>> 21248.0 16029.6      5    0.268   0      0.000    0.268
>>>>>>
>>>>>> At this point I'll likely just rebuild the cluster.  Its not in prod
>>>>>> yet as I still need to tune it.  I should have wrote 2 separate emails :)
>>>>>>
>>>>>> Thanks,
>>>>>> S
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <michael@fullcontact.com
>>>>>> > wrote:
>>>>>>
>>>>>>> I'm not seeing too much to substantiate that. What size heap are you
>>>>>>> running, and is it near filled? Perhaps attach VisualVM and check for GC
>>>>>>> activity.
>>>>>>>
>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>>> michael@fullcontact.com
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>>
>>>>>>>> Here it is.  Appears to be some kind of race condition.
>>>>>>>>
>>>>>>>> http://pastebin.com/dANT8SQR
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <
>>>>>>>> michael@fullcontact.com> wrote:
>>>>>>>>
>>>>>>>>> Can you do a thread dump and pastebin it? It's a nice first step
>>>>>>>>> to figure this out.
>>>>>>>>>
>>>>>>>>> I just checked on our Nimbus and while it's on a larger machine,
>>>>>>>>> it's using <1% CPU. Also look in your logs for any clues.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/>
>>>>>>>>> michael@fullcontact.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <se...@solbak.ca>wrote:
>>>>>>>>>
>>>>>>>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>>>>>>>> workers, 1 nimbus and 1 zookeeper.
>>>>>>>>>>
>>>>>>>>>> I suppose I could just create a new cluster but Id like to know
>>>>>>>>>> why this is occurring to avoid future production outages.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> S
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <
>>>>>>>>>> michael@fullcontact.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>>>>>>>
>>>>>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/>
>>>>>>>>>>> michael@fullcontact.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca>wrote:
>>>>>>>>>>>
>>>>>>>>>>>> This is the first step of 4. When I save to db I'm actually
>>>>>>>>>>>> saving to a queue, (just using db for now).  The 2nd step we index the data
>>>>>>>>>>>> and 3rd we do aggregation/counts for reporting.  The last is a search that
>>>>>>>>>>>> I'm planning on using drpc for.  Within step 2 we pipe certain datasets in
>>>>>>>>>>>> real time to the clients it applies to.  I'd like this and the drpc to be
>>>>>>>>>>>> sub 2s which should be reasonable.
>>>>>>>>>>>>
>>>>>>>>>>>> Your right that I could speed up step 1 by not using trident
>>>>>>>>>>>> but our requirements seem like a good use case for the other 3 steps.  With
>>>>>>>>>>>> many results per second batching should effect performance a ton if the
>>>>>>>>>>>> batch size is small enough.
>>>>>>>>>>>>
>>>>>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies
>>>>>>>>>>>> killed?
>>>>>>>>>>>>
>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>
>>>>>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <
>>>>>>>>>>>> sean@monkeysnatchbanana.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a reason you are using trident?
>>>>>>>>>>>>
>>>>>>>>>>>> If you don't need to handle the events as a batch, you are
>>>>>>>>>>>> probably going to get performance w/o it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca>wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - 4 spouts of events
>>>>>>>>>>>>> - merges into one stream
>>>>>>>>>>>>> - serializes the object as an event in a string
>>>>>>>>>>>>> - saves to db
>>>>>>>>>>>>>
>>>>>>>>>>>>> I split the serialization task away from the spout as it was
>>>>>>>>>>>>> cpu intensive to speed it up.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem I have is that after 10 minutes there is over 910k
>>>>>>>>>>>>> tuples emitted/transfered but only 193k records are saved.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The overall load of the topology seems fine.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the
>>>>>>>>>>>>> serialization one.
>>>>>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms
>>>>>>>>>>>>> process latency.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So it seems trident has all the records internally, but I need
>>>>>>>>>>>>> these events as close to realtime as possible.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does anyone have any guidance as to how to increase the
>>>>>>>>>>>>> throughput?  Is it simply a matter of tweeking max spout pending and the
>>>>>>>>>>>>> batch size?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>>>>>>>> upgrade it until the demand on the boxes seems higher.  Although CPU usage
>>>>>>>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at
>>>>>>>>>>>>> 99% even when all the topologies are killed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are currently targeting processing 200 million records per
>>>>>>>>>>>>> day which seems like it should be quite easy based on what Ive read that
>>>>>>>>>>>>> other people have achieved.  I realize that hardware should be able to
>>>>>>>>>>>>> boost this as well but my first goal is to get trident to push the records
>>>>>>>>>>>>> to the db quicker.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>> Sean
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Ce n'est pas une signature
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>>>>> Solbak Technologies Inc.
>>>>>>>>>> 780.893.7326 (m)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>>> Solbak Technologies Inc.
>>>>>>>> 780.893.7326 (m)
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>>
>>>>>> Sean Solbak, BsC, MCSD
>>>>>> Solbak Technologies Inc.
>>>>>> 780.893.7326 (m)
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> Sean Solbak, BsC, MCSD
>>>> Solbak Technologies Inc.
>>>> 780.893.7326 (m)
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>>
>>> Sean Solbak, BsC, MCSD
>>> Solbak Technologies Inc.
>>> 780.893.7326 (m)
>>>
>>
>>
>

Re: Tuning and nimbus at 99%

Posted by Michael Rose <mi...@fullcontact.com>.

Otis,

I'm a fan of SPM for Storm, but there's other debugging that needs to be
done here if the process quits constantly.

Sean,

Since you're using storm-deploy, I assume the processes are running under
supervisor. It might be worth killing the supervisor by hand, then running
it yourself (ssh as storm, cd storm/daemon, supervise .) and seeing what
kind of errors you see.

Are your disks perhaps filled?

Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
michael@fullcontact.com


On Mon, Mar 3, 2014 at 6:49 PM, Otis Gospodnetic <otis.gospodnetic@gmail.com
> wrote:

> Hi Sean,
>
> I don't think you can see the metrics you need to see with AWS CloudWatch.
>  Have a look at SPM for Storm.  You can share graphs from SPM directly if
> you want, so you don't have to grab and attach screenshots manually. See:
>
> http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/+
> http://sematext.com/spm/
>
> My bet is that you'll see GC metrics spikes....
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Mon, Mar 3, 2014 at 8:21 PM, Sean Solbak <se...@solbak.ca> wrote:
>
>> I just created a brand new cluster with storm-deploy command.
>>
>> lein deploy-storm --start --name storm-dev --commit
>> 1bcc169f5096e03a4ae117efc65c0f9bcfa2fa22
>>
>> I had a meeting, did nothing to the box, no topologies were run.  I came
>> back 2 hours later and nimbus was at 100% cpu.
>>
>> I'm running on an m1-small on the following ami - ami-58a3cf68. Im
>> unable to get a threaddump as the process is getting killed and restarted
>> too fast.  I did attach a 3 hour snapshot of the ec2 monitors.  Any
>> guidance would be much appreciated.
>>
>> Thanks,
>> S
>>
>>
>>
>>
>>
>>
>> On Sun, Mar 2, 2014 at 9:11 PM, Sean Solbak <se...@solbak.ca> wrote:
>>
>>> The only error in the logs is which happened over 10 days ago was.
>>>
>>> 2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event
>>> java.io.IOException: Unable to delete directory
>>> /mnt/storm/nimbus/stormdist/test-25-1393022928.
>>>         at
>>> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981)
>>> ~[commons-io-1.4.jar:1.4]
>>>         at
>>> org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
>>> ~[commons-io-1.4.jar:1.4]
>>>         at backtype.storm.util$rmr.invoke(util.clj:442)
>>> ~[storm-core-0.9.0.1.jar:na]
>>>         at
>>> backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819)
>>> ~[storm-core-0.9.0.1.jar:na]
>>>         at
>>> backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896)
>>> ~[storm-core-0.9.0.1.jar:na]
>>>         at
>>> backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77)
>>> ~[storm-core-0.9.0.1.jar:na]
>>>         at
>>> backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33)
>>> ~[storm-core-0.9.0.1.jar:na]
>>>         at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26)
>>> ~[storm-core-0.9.0.1.jar:na]
>>>         at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
>>>         at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27]
>>>
>>> Its fine.  I can rebuild a new cluster.  Storm deploy makes it pretty
>>> easy.
>>>
>>> Thanks for you help on this!
>>>
>>> As for my other question.
>>>
>>> If my trident batch interval is 500ms and I keep the spout pending and
>>> batch size small enough, will I be able to get real time results (ie sub 2
>>> seconds)?  I've played with the various metrics (I literally have a
>>> spreadsheet of parameters to results) and haven't been able to get it.  Am
>>> I just doing it wrong?  What would the key parameters be?  The complete
>>> latency is 500 ms but trident seems to be way behind despite non of my
>>> bolts having a capacity > 0.6.  This may have to do with nimbus being
>>> throttled so I will report back.  But if there are people out there who
>>> have done this kind of thing, Id like to know if Im missing an obvious
>>> parameter or something.
>>>
>>> Thanks,
>>> S
>>>
>>>
>>>
>>> On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>
>>>> The fact that the process is being killed constantly is a red flag.
>>>> Also, why are you running it as a client VM?
>>>>
>>>> Check your nimbus.log to see why it's restarting.
>>>>
>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>> michael@fullcontact.com
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>
>>>>>   uintx ErgoHeapSizeLimit                         = 0
>>>>> {product}
>>>>>     uintx InitialHeapSize                          := 27080896
>>>>>  {product}
>>>>>     uintx LargePageHeapSizeThreshold                = 134217728
>>>>> {product}
>>>>>     uintx MaxHeapSize                              := 698351616
>>>>> {product}
>>>>>
>>>>>
>>>>> so initial size of ~25mb and max of ~666 mb
>>>>>
>>>>> Its a client process (not server ie the command is "java -client
>>>>> -Dstorm.options...").  The process gets killed and restarted continously
>>>>> with a new PID (which makes getting the PID tough to get stats on).  I dont
>>>>> have VisualVM but if I run
>>>>>
>>>>> jstat -gc PID, I get
>>>>>
>>>>>  S0C    S1C    S0U    S1U      EC       EU        OC         OU
>>>>> PC     PU    YGC     YGCT    FGC    FGCT     GCT
>>>>> 832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0
>>>>> 21248.0 16029.6      5    0.268   0      0.000    0.268
>>>>>
>>>>> At this point I'll likely just rebuild the cluster.  Its not in prod
>>>>> yet as I still need to tune it.  I should have wrote 2 separate emails :)
>>>>>
>>>>> Thanks,
>>>>> S
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>>>
>>>>>> I'm not seeing too much to substantiate that. What size heap are you
>>>>>> running, and is it near filled? Perhaps attach VisualVM and check for GC
>>>>>> activity.
>>>>>>
>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>> michael@fullcontact.com
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>
>>>>>>> Here it is.  Appears to be some kind of race condition.
>>>>>>>
>>>>>>> http://pastebin.com/dANT8SQR
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <
>>>>>>> michael@fullcontact.com> wrote:
>>>>>>>
>>>>>>>> Can you do a thread dump and pastebin it? It's a nice first step to
>>>>>>>> figure this out.
>>>>>>>>
>>>>>>>> I just checked on our Nimbus and while it's on a larger machine,
>>>>>>>> it's using <1% CPU. Also look in your logs for any clues.
>>>>>>>>
>>>>>>>>
>>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>>>> michael@fullcontact.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>>>
>>>>>>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>>>>>>> workers, 1 nimbus and 1 zookeeper.
>>>>>>>>>
>>>>>>>>> I suppose I could just create a new cluster but Id like to know
>>>>>>>>> why this is occurring to avoid future production outages.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> S
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <
>>>>>>>>> michael@fullcontact.com> wrote:
>>>>>>>>>
>>>>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>>>>>>
>>>>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/>
>>>>>>>>>> michael@fullcontact.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca>wrote:
>>>>>>>>>>
>>>>>>>>>>> This is the first step of 4. When I save to db I'm actually
>>>>>>>>>>> saving to a queue, (just using db for now).  The 2nd step we index the data
>>>>>>>>>>> and 3rd we do aggregation/counts for reporting.  The last is a search that
>>>>>>>>>>> I'm planning on using drpc for.  Within step 2 we pipe certain datasets in
>>>>>>>>>>> real time to the clients it applies to.  I'd like this and the drpc to be
>>>>>>>>>>> sub 2s which should be reasonable.
>>>>>>>>>>>
>>>>>>>>>>> Your right that I could speed up step 1 by not using trident but
>>>>>>>>>>> our requirements seem like a good use case for the other 3 steps.  With
>>>>>>>>>>> many results per second batching should effect performance a ton if the
>>>>>>>>>>> batch size is small enough.
>>>>>>>>>>>
>>>>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies
>>>>>>>>>>> killed?
>>>>>>>>>>>
>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>
>>>>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <
>>>>>>>>>>> sean@monkeysnatchbanana.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Is there a reason you are using trident?
>>>>>>>>>>>
>>>>>>>>>>> If you don't need to handle the events as a batch, you are
>>>>>>>>>>> probably going to get performance w/o it.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca>wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>>>>>>
>>>>>>>>>>>> - 4 spouts of events
>>>>>>>>>>>> - merges into one stream
>>>>>>>>>>>> - serializes the object as an event in a string
>>>>>>>>>>>> - saves to db
>>>>>>>>>>>>
>>>>>>>>>>>> I split the serialization task away from the spout as it was
>>>>>>>>>>>> cpu intensive to speed it up.
>>>>>>>>>>>>
>>>>>>>>>>>> The problem I have is that after 10 minutes there is over 910k
>>>>>>>>>>>> tuples emitted/transfered but only 193k records are saved.
>>>>>>>>>>>>
>>>>>>>>>>>> The overall load of the topology seems fine.
>>>>>>>>>>>>
>>>>>>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the
>>>>>>>>>>>> serialization one.
>>>>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms
>>>>>>>>>>>> process latency.
>>>>>>>>>>>>
>>>>>>>>>>>> So it seems trident has all the records internally, but I need
>>>>>>>>>>>> these events as close to realtime as possible.
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone have any guidance as to how to increase the
>>>>>>>>>>>> throughput?  Is it simply a matter of tweeking max spout pending and the
>>>>>>>>>>>> batch size?
>>>>>>>>>>>>
>>>>>>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>>>>>>> upgrade it until the demand on the boxes seems higher.  Although CPU usage
>>>>>>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at
>>>>>>>>>>>> 99% even when all the topologies are killed.
>>>>>>>>>>>>
>>>>>>>>>>>> We are currently targeting processing 200 million records per
>>>>>>>>>>>> day which seems like it should be quite easy based on what Ive read that
>>>>>>>>>>>> other people have achieved.  I realize that hardware should be able to
>>>>>>>>>>>> boost this as well but my first goal is to get trident to push the records
>>>>>>>>>>>> to the db quicker.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>> Sean
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Ce n'est pas une signature
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>>>> Solbak Technologies Inc.
>>>>>>>>> 780.893.7326 (m)
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>> Solbak Technologies Inc.
>>>>>>> 780.893.7326 (m)
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>>
>>>>> Sean Solbak, BsC, MCSD
>>>>> Solbak Technologies Inc.
>>>>> 780.893.7326 (m)
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>>
>>> Sean Solbak, BsC, MCSD
>>> Solbak Technologies Inc.
>>> 780.893.7326 (m)
>>>
>>
>>
>>
>> --
>> Thanks,
>>
>> Sean Solbak, BsC, MCSD
>> Solbak Technologies Inc.
>> 780.893.7326 (m)
>>
>
>

Re: Tuning and nimbus at 99%

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi Sean,

I don't think you can see the metrics you need to see with AWS CloudWatch.
 Have a look at SPM for Storm.  You can share graphs from SPM directly if
you want, so you don't have to grab and attach screenshots manually. See:
http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/+
http://sematext.com/spm/

My bet is that you'll see GC metrics spikes....

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Mon, Mar 3, 2014 at 8:21 PM, Sean Solbak <se...@solbak.ca> wrote:

> I just created a brand new cluster with storm-deploy command.
>
> lein deploy-storm --start --name storm-dev --commit
> 1bcc169f5096e03a4ae117efc65c0f9bcfa2fa22
>
> I had a meeting, did nothing to the box, no topologies were run.  I came
> back 2 hours later and nimbus was at 100% cpu.
>
> I'm running on an m1-small on the following ami - ami-58a3cf68. Im unable
> to get a threaddump as the process is getting killed and restarted too
> fast.  I did attach a 3 hour snapshot of the ec2 monitors.  Any guidance
> would be much appreciated.
>
> Thanks,
> S
>
>
>
>
>
>
> On Sun, Mar 2, 2014 at 9:11 PM, Sean Solbak <se...@solbak.ca> wrote:
>
>> The only error in the logs is which happened over 10 days ago was.
>>
>> 2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event
>> java.io.IOException: Unable to delete directory
>> /mnt/storm/nimbus/stormdist/test-25-1393022928.
>>         at
>> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981)
>> ~[commons-io-1.4.jar:1.4]
>>         at
>> org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
>> ~[commons-io-1.4.jar:1.4]
>>         at backtype.storm.util$rmr.invoke(util.clj:442)
>> ~[storm-core-0.9.0.1.jar:na]
>>         at backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819)
>> ~[storm-core-0.9.0.1.jar:na]
>>         at
>> backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896)
>> ~[storm-core-0.9.0.1.jar:na]
>>         at
>> backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77)
>> ~[storm-core-0.9.0.1.jar:na]
>>         at
>> backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33)
>> ~[storm-core-0.9.0.1.jar:na]
>>         at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26)
>> ~[storm-core-0.9.0.1.jar:na]
>>         at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
>>         at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27]
>>
>> Its fine.  I can rebuild a new cluster.  Storm deploy makes it pretty
>> easy.
>>
>> Thanks for you help on this!
>>
>> As for my other question.
>>
>> If my trident batch interval is 500ms and I keep the spout pending and
>> batch size small enough, will I be able to get real time results (ie sub 2
>> seconds)?  I've played with the various metrics (I literally have a
>> spreadsheet of parameters to results) and haven't been able to get it.  Am
>> I just doing it wrong?  What would the key parameters be?  The complete
>> latency is 500 ms but trident seems to be way behind despite non of my
>> bolts having a capacity > 0.6.  This may have to do with nimbus being
>> throttled so I will report back.  But if there are people out there who
>> have done this kind of thing, Id like to know if Im missing an obvious
>> parameter or something.
>>
>> Thanks,
>> S
>>
>>
>>
>> On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>
>>> The fact that the process is being killed constantly is a red flag.
>>> Also, why are you running it as a client VM?
>>>
>>> Check your nimbus.log to see why it's restarting.
>>>
>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>> michael@fullcontact.com
>>>
>>>
>>> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>
>>>>   uintx ErgoHeapSizeLimit                         = 0
>>>> {product}
>>>>     uintx InitialHeapSize                          := 27080896
>>>>  {product}
>>>>     uintx LargePageHeapSizeThreshold                = 134217728
>>>> {product}
>>>>     uintx MaxHeapSize                              := 698351616
>>>> {product}
>>>>
>>>>
>>>> so initial size of ~25mb and max of ~666 mb
>>>>
>>>> Its a client process (not server ie the command is "java -client
>>>> -Dstorm.options...").  The process gets killed and restarted continously
>>>> with a new PID (which makes getting the PID tough to get stats on).  I dont
>>>> have VisualVM but if I run
>>>>
>>>> jstat -gc PID, I get
>>>>
>>>>  S0C    S1C    S0U    S1U      EC       EU        OC         OU
>>>> PC     PU    YGC     YGCT    FGC    FGCT     GCT
>>>> 832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0
>>>> 21248.0 16029.6      5    0.268   0      0.000    0.268
>>>>
>>>> At this point I'll likely just rebuild the cluster.  Its not in prod
>>>> yet as I still need to tune it.  I should have wrote 2 separate emails :)
>>>>
>>>> Thanks,
>>>> S
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>>
>>>>> I'm not seeing too much to substantiate that. What size heap are you
>>>>> running, and is it near filled? Perhaps attach VisualVM and check for GC
>>>>> activity.
>>>>>
>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>> michael@fullcontact.com
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>
>>>>>> Here it is.  Appears to be some kind of race condition.
>>>>>>
>>>>>> http://pastebin.com/dANT8SQR
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <michael@fullcontact.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Can you do a thread dump and pastebin it? It's a nice first step to
>>>>>>> figure this out.
>>>>>>>
>>>>>>> I just checked on our Nimbus and while it's on a larger machine,
>>>>>>> it's using <1% CPU. Also look in your logs for any clues.
>>>>>>>
>>>>>>>
>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>>> michael@fullcontact.com
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>>
>>>>>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>>>>>> workers, 1 nimbus and 1 zookeeper.
>>>>>>>>
>>>>>>>> I suppose I could just create a new cluster but Id like to know why
>>>>>>>> this is occurring to avoid future production outages.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> S
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <
>>>>>>>> michael@fullcontact.com> wrote:
>>>>>>>>
>>>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>>>>>
>>>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>>> Senior Platform Engineer, FullContact<http://www.fullcontact.com/>
>>>>>>>>> michael@fullcontact.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca>wrote:
>>>>>>>>>
>>>>>>>>>> This is the first step of 4. When I save to db I'm actually
>>>>>>>>>> saving to a queue, (just using db for now).  The 2nd step we index the data
>>>>>>>>>> and 3rd we do aggregation/counts for reporting.  The last is a search that
>>>>>>>>>> I'm planning on using drpc for.  Within step 2 we pipe certain datasets in
>>>>>>>>>> real time to the clients it applies to.  I'd like this and the drpc to be
>>>>>>>>>> sub 2s which should be reasonable.
>>>>>>>>>>
>>>>>>>>>> Your right that I could speed up step 1 by not using trident but
>>>>>>>>>> our requirements seem like a good use case for the other 3 steps.  With
>>>>>>>>>> many results per second batching should effect performance a ton if the
>>>>>>>>>> batch size is small enough.
>>>>>>>>>>
>>>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies
>>>>>>>>>> killed?
>>>>>>>>>>
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>
>>>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <
>>>>>>>>>> sean@monkeysnatchbanana.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Is there a reason you are using trident?
>>>>>>>>>>
>>>>>>>>>> If you don't need to handle the events as a batch, you are
>>>>>>>>>> probably going to get performance w/o it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca>wrote:
>>>>>>>>>>
>>>>>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>>>>>
>>>>>>>>>>> - 4 spouts of events
>>>>>>>>>>> - merges into one stream
>>>>>>>>>>> - serializes the object as an event in a string
>>>>>>>>>>> - saves to db
>>>>>>>>>>>
>>>>>>>>>>> I split the serialization task away from the spout as it was cpu
>>>>>>>>>>> intensive to speed it up.
>>>>>>>>>>>
>>>>>>>>>>> The problem I have is that after 10 minutes there is over 910k
>>>>>>>>>>> tuples emitted/transfered but only 193k records are saved.
>>>>>>>>>>>
>>>>>>>>>>> The overall load of the topology seems fine.
>>>>>>>>>>>
>>>>>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the
>>>>>>>>>>> serialization one.
>>>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms
>>>>>>>>>>> process latency.
>>>>>>>>>>>
>>>>>>>>>>> So it seems trident has all the records internally, but I need
>>>>>>>>>>> these events as close to realtime as possible.
>>>>>>>>>>>
>>>>>>>>>>> Does anyone have any guidance as to how to increase the
>>>>>>>>>>> throughput?  Is it simply a matter of tweeking max spout pending and the
>>>>>>>>>>> batch size?
>>>>>>>>>>>
>>>>>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>>>>>> upgrade it until the demand on the boxes seems higher.  Although CPU usage
>>>>>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at
>>>>>>>>>>> 99% even when all the topologies are killed.
>>>>>>>>>>>
>>>>>>>>>>> We are currently targeting processing 200 million records per
>>>>>>>>>>> day which seems like it should be quite easy based on what Ive read that
>>>>>>>>>>> other people have achieved.  I realize that hardware should be able to
>>>>>>>>>>> boost this as well but my first goal is to get trident to push the records
>>>>>>>>>>> to the db quicker.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>> Sean
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Ce n'est pas une signature
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>>> Solbak Technologies Inc.
>>>>>>>> 780.893.7326 (m)
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>>
>>>>>> Sean Solbak, BsC, MCSD
>>>>>> Solbak Technologies Inc.
>>>>>> 780.893.7326 (m)
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> Sean Solbak, BsC, MCSD
>>>> Solbak Technologies Inc.
>>>> 780.893.7326 (m)
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>>
>> Sean Solbak, BsC, MCSD
>> Solbak Technologies Inc.
>> 780.893.7326 (m)
>>
>
>
>
> --
> Thanks,
>
> Sean Solbak, BsC, MCSD
> Solbak Technologies Inc.
> 780.893.7326 (m)
>

Re: Tuning and nimbus at 99%

Posted by Sean Solbak <se...@solbak.ca>.

I just created a brand new cluster with storm-deploy command.

lein deploy-storm --start --name storm-dev --commit
1bcc169f5096e03a4ae117efc65c0f9bcfa2fa22

I had a meeting, did nothing to the box, no topologies were run.  I came
back 2 hours later and nimbus was at 100% cpu.

I'm running on an m1-small on the following ami - ami-58a3cf68. Im unable
to get a threaddump as the process is getting killed and restarted too
fast.  I did attach a 3 hour snapshot of the ec2 monitors.  Any guidance
would be much appreciated.

Thanks,
S






On Sun, Mar 2, 2014 at 9:11 PM, Sean Solbak <se...@solbak.ca> wrote:

> The only error in the logs is which happened over 10 days ago was.
>
> 2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event
> java.io.IOException: Unable to delete directory
> /mnt/storm/nimbus/stormdist/test-25-1393022928.
>         at
> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981)
> ~[commons-io-1.4.jar:1.4]
>         at
> org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
> ~[commons-io-1.4.jar:1.4]
>         at backtype.storm.util$rmr.invoke(util.clj:442)
> ~[storm-core-0.9.0.1.jar:na]
>         at backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819)
> ~[storm-core-0.9.0.1.jar:na]
>         at
> backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896)
> ~[storm-core-0.9.0.1.jar:na]
>         at
> backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77)
> ~[storm-core-0.9.0.1.jar:na]
>         at
> backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33)
> ~[storm-core-0.9.0.1.jar:na]
>         at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26)
> ~[storm-core-0.9.0.1.jar:na]
>         at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
>         at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27]
>
> Its fine.  I can rebuild a new cluster.  Storm deploy makes it pretty easy.
>
> Thanks for you help on this!
>
> As for my other question.
>
> If my trident batch interval is 500ms and I keep the spout pending and
> batch size small enough, will I be able to get real time results (ie sub 2
> seconds)?  I've played with the various metrics (I literally have a
> spreadsheet of parameters to results) and haven't been able to get it.  Am
> I just doing it wrong?  What would the key parameters be?  The complete
> latency is 500 ms but trident seems to be way behind despite non of my
> bolts having a capacity > 0.6.  This may have to do with nimbus being
> throttled so I will report back.  But if there are people out there who
> have done this kind of thing, Id like to know if Im missing an obvious
> parameter or something.
>
> Thanks,
> S
>
>
>
> On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose <mi...@fullcontact.com>wrote:
>
>> The fact that the process is being killed constantly is a red flag. Also,
>> why are you running it as a client VM?
>>
>> Check your nimbus.log to see why it's restarting.
>>
>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>> michael@fullcontact.com
>>
>>
>> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <se...@solbak.ca> wrote:
>>
>>>   uintx ErgoHeapSizeLimit                         = 0
>>> {product}
>>>     uintx InitialHeapSize                          := 27080896
>>>  {product}
>>>     uintx LargePageHeapSizeThreshold                = 134217728
>>> {product}
>>>     uintx MaxHeapSize                              := 698351616
>>> {product}
>>>
>>>
>>> so initial size of ~25mb and max of ~666 mb
>>>
>>> Its a client process (not server ie the command is "java -client
>>> -Dstorm.options...").  The process gets killed and restarted continously
>>> with a new PID (which makes getting the PID tough to get stats on).  I dont
>>> have VisualVM but if I run
>>>
>>> jstat -gc PID, I get
>>>
>>>  S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC
>>>     PU    YGC     YGCT    FGC    FGCT     GCT
>>> 832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0
>>> 21248.0 16029.6      5    0.268   0      0.000    0.268
>>>
>>> At this point I'll likely just rebuild the cluster.  Its not in prod yet
>>> as I still need to tune it.  I should have wrote 2 separate emails :)
>>>
>>> Thanks,
>>> S
>>>
>>>
>>>
>>>
>>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>
>>>> I'm not seeing too much to substantiate that. What size heap are you
>>>> running, and is it near filled? Perhaps attach VisualVM and check for GC
>>>> activity.
>>>>
>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>> michael@fullcontact.com
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>
>>>>> Here it is.  Appears to be some kind of race condition.
>>>>>
>>>>> http://pastebin.com/dANT8SQR
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>>>
>>>>>> Can you do a thread dump and pastebin it? It's a nice first step to
>>>>>> figure this out.
>>>>>>
>>>>>> I just checked on our Nimbus and while it's on a larger machine, it's
>>>>>> using <1% CPU. Also look in your logs for any clues.
>>>>>>
>>>>>>
>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>> michael@fullcontact.com
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>
>>>>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>>>>> workers, 1 nimbus and 1 zookeeper.
>>>>>>>
>>>>>>> I suppose I could just create a new cluster but Id like to know why
>>>>>>> this is occurring to avoid future production outages.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> S
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <
>>>>>>> michael@fullcontact.com> wrote:
>>>>>>>
>>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>>>>
>>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>>>> michael@fullcontact.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>>>
>>>>>>>>> This is the first step of 4. When I save to db I'm actually saving
>>>>>>>>> to a queue, (just using db for now).  The 2nd step we index the data and
>>>>>>>>> 3rd we do aggregation/counts for reporting.  The last is a search that I'm
>>>>>>>>> planning on using drpc for.  Within step 2 we pipe certain datasets in real
>>>>>>>>> time to the clients it applies to.  I'd like this and the drpc to be sub 2s
>>>>>>>>> which should be reasonable.
>>>>>>>>>
>>>>>>>>> Your right that I could speed up step 1 by not using trident but
>>>>>>>>> our requirements seem like a good use case for the other 3 steps.  With
>>>>>>>>> many results per second batching should effect performance a ton if the
>>>>>>>>> batch size is small enough.
>>>>>>>>>
>>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies
>>>>>>>>> killed?
>>>>>>>>>
>>>>>>>>> Sent from my iPhone
>>>>>>>>>
>>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <
>>>>>>>>> sean@monkeysnatchbanana.com> wrote:
>>>>>>>>>
>>>>>>>>> Is there a reason you are using trident?
>>>>>>>>>
>>>>>>>>> If you don't need to handle the events as a batch, you are
>>>>>>>>> probably going to get performance w/o it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca>wrote:
>>>>>>>>>
>>>>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>>>>
>>>>>>>>>> - 4 spouts of events
>>>>>>>>>> - merges into one stream
>>>>>>>>>> - serializes the object as an event in a string
>>>>>>>>>> - saves to db
>>>>>>>>>>
>>>>>>>>>> I split the serialization task away from the spout as it was cpu
>>>>>>>>>> intensive to speed it up.
>>>>>>>>>>
>>>>>>>>>> The problem I have is that after 10 minutes there is over 910k
>>>>>>>>>> tuples emitted/transfered but only 193k records are saved.
>>>>>>>>>>
>>>>>>>>>> The overall load of the topology seems fine.
>>>>>>>>>>
>>>>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the
>>>>>>>>>> serialization one.
>>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms
>>>>>>>>>> process latency.
>>>>>>>>>>
>>>>>>>>>> So it seems trident has all the records internally, but I need
>>>>>>>>>> these events as close to realtime as possible.
>>>>>>>>>>
>>>>>>>>>> Does anyone have any guidance as to how to increase the
>>>>>>>>>> throughput?  Is it simply a matter of tweeking max spout pending and the
>>>>>>>>>> batch size?
>>>>>>>>>>
>>>>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>>>>> upgrade it until the demand on the boxes seems higher.  Although CPU usage
>>>>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at
>>>>>>>>>> 99% even when all the topologies are killed.
>>>>>>>>>>
>>>>>>>>>> We are currently targeting processing 200 million records per day
>>>>>>>>>> which seems like it should be quite easy based on what Ive read that other
>>>>>>>>>> people have achieved.  I realize that hardware should be able to boost this
>>>>>>>>>> as well but my first goal is to get trident to push the records to the db
>>>>>>>>>> quicker.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance,
>>>>>>>>>> Sean
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Ce n'est pas une signature
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Sean Solbak, BsC, MCSD
>>>>>>> Solbak Technologies Inc.
>>>>>>> 780.893.7326 (m)
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>>
>>>>> Sean Solbak, BsC, MCSD
>>>>> Solbak Technologies Inc.
>>>>> 780.893.7326 (m)
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>>
>>> Sean Solbak, BsC, MCSD
>>> Solbak Technologies Inc.
>>> 780.893.7326 (m)
>>>
>>
>>
>
>
> --
> Thanks,
>
> Sean Solbak, BsC, MCSD
> Solbak Technologies Inc.
> 780.893.7326 (m)
>



-- 
Thanks,

Sean Solbak, BsC, MCSD
Solbak Technologies Inc.
780.893.7326 (m)

Re: Tuning and nimbus at 99%

Posted by Sean Solbak <se...@solbak.ca>.

The only error in the logs is which happened over 10 days ago was.

2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event
java.io.IOException: Unable to delete directory
/mnt/storm/nimbus/stormdist/test-25-1393022928.
        at
org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981)
~[commons-io-1.4.jar:1.4]
        at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
~[commons-io-1.4.jar:1.4]
        at backtype.storm.util$rmr.invoke(util.clj:442)
~[storm-core-0.9.0.1.jar:na]
        at backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819)
~[storm-core-0.9.0.1.jar:na]
        at
backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896)
~[storm-core-0.9.0.1.jar:na]
        at
backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77)
~[storm-core-0.9.0.1.jar:na]
        at
backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33)
~[storm-core-0.9.0.1.jar:na]
        at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26)
~[storm-core-0.9.0.1.jar:na]
        at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
        at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27]

Its fine.  I can rebuild a new cluster.  Storm deploy makes it pretty easy.

Thanks for you help on this!

As for my other question.

If my trident batch interval is 500ms and I keep the spout pending and
batch size small enough, will I be able to get real time results (ie sub 2
seconds)?  I've played with the various metrics (I literally have a
spreadsheet of parameters to results) and haven't been able to get it.  Am
I just doing it wrong?  What would the key parameters be?  The complete
latency is 500 ms but trident seems to be way behind despite non of my
bolts having a capacity > 0.6.  This may have to do with nimbus being
throttled so I will report back.  But if there are people out there who
have done this kind of thing, Id like to know if Im missing an obvious
parameter or something.

Thanks,
S



On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose <mi...@fullcontact.com>wrote:

> The fact that the process is being killed constantly is a red flag. Also,
> why are you running it as a client VM?
>
> Check your nimbus.log to see why it's restarting.
>
> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> michael@fullcontact.com
>
>
> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <se...@solbak.ca> wrote:
>
>>   uintx ErgoHeapSizeLimit                         = 0
>> {product}
>>     uintx InitialHeapSize                          := 27080896
>>  {product}
>>     uintx LargePageHeapSizeThreshold                = 134217728
>> {product}
>>     uintx MaxHeapSize                              := 698351616
>> {product}
>>
>>
>> so initial size of ~25mb and max of ~666 mb
>>
>> Its a client process (not server ie the command is "java -client
>> -Dstorm.options...").  The process gets killed and restarted continously
>> with a new PID (which makes getting the PID tough to get stats on).  I dont
>> have VisualVM but if I run
>>
>> jstat -gc PID, I get
>>
>>  S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC
>>     PU    YGC     YGCT    FGC    FGCT     GCT
>> 832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0
>> 21248.0 16029.6      5    0.268   0      0.000    0.268
>>
>> At this point I'll likely just rebuild the cluster.  Its not in prod yet
>> as I still need to tune it.  I should have wrote 2 separate emails :)
>>
>> Thanks,
>> S
>>
>>
>>
>>
>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>
>>> I'm not seeing too much to substantiate that. What size heap are you
>>> running, and is it near filled? Perhaps attach VisualVM and check for GC
>>> activity.
>>>
>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>> michael@fullcontact.com
>>>
>>>
>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>
>>>> Here it is.  Appears to be some kind of race condition.
>>>>
>>>> http://pastebin.com/dANT8SQR
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>>
>>>>> Can you do a thread dump and pastebin it? It's a nice first step to
>>>>> figure this out.
>>>>>
>>>>> I just checked on our Nimbus and while it's on a larger machine, it's
>>>>> using <1% CPU. Also look in your logs for any clues.
>>>>>
>>>>>
>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>> michael@fullcontact.com
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>
>>>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>>>> workers, 1 nimbus and 1 zookeeper.
>>>>>>
>>>>>> I suppose I could just create a new cluster but Id like to know why
>>>>>> this is occurring to avoid future production outages.
>>>>>>
>>>>>> Thanks,
>>>>>> S
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <michael@fullcontact.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>>>
>>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>>> michael@fullcontact.com
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>>
>>>>>>>> This is the first step of 4. When I save to db I'm actually saving
>>>>>>>> to a queue, (just using db for now).  The 2nd step we index the data and
>>>>>>>> 3rd we do aggregation/counts for reporting.  The last is a search that I'm
>>>>>>>> planning on using drpc for.  Within step 2 we pipe certain datasets in real
>>>>>>>> time to the clients it applies to.  I'd like this and the drpc to be sub 2s
>>>>>>>> which should be reasonable.
>>>>>>>>
>>>>>>>> Your right that I could speed up step 1 by not using trident but
>>>>>>>> our requirements seem like a good use case for the other 3 steps.  With
>>>>>>>> many results per second batching should effect performance a ton if the
>>>>>>>> batch size is small enough.
>>>>>>>>
>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies
>>>>>>>> killed?
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <se...@monkeysnatchbanana.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Is there a reason you are using trident?
>>>>>>>>
>>>>>>>> If you don't need to handle the events as a batch, you are probably
>>>>>>>> going to get performance w/o it.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>>>
>>>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>>>
>>>>>>>>> - 4 spouts of events
>>>>>>>>> - merges into one stream
>>>>>>>>> - serializes the object as an event in a string
>>>>>>>>> - saves to db
>>>>>>>>>
>>>>>>>>> I split the serialization task away from the spout as it was cpu
>>>>>>>>> intensive to speed it up.
>>>>>>>>>
>>>>>>>>> The problem I have is that after 10 minutes there is over 910k
>>>>>>>>> tuples emitted/transfered but only 193k records are saved.
>>>>>>>>>
>>>>>>>>> The overall load of the topology seems fine.
>>>>>>>>>
>>>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the
>>>>>>>>> serialization one.
>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms
>>>>>>>>> process latency.
>>>>>>>>>
>>>>>>>>> So it seems trident has all the records internally, but I need
>>>>>>>>> these events as close to realtime as possible.
>>>>>>>>>
>>>>>>>>> Does anyone have any guidance as to how to increase the
>>>>>>>>> throughput?  Is it simply a matter of tweeking max spout pending and the
>>>>>>>>> batch size?
>>>>>>>>>
>>>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>>>> upgrade it until the demand on the boxes seems higher.  Although CPU usage
>>>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at
>>>>>>>>> 99% even when all the topologies are killed.
>>>>>>>>>
>>>>>>>>> We are currently targeting processing 200 million records per day
>>>>>>>>> which seems like it should be quite easy based on what Ive read that other
>>>>>>>>> people have achieved.  I realize that hardware should be able to boost this
>>>>>>>>> as well but my first goal is to get trident to push the records to the db
>>>>>>>>> quicker.
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Sean
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Ce n'est pas une signature
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>>
>>>>>> Sean Solbak, BsC, MCSD
>>>>>> Solbak Technologies Inc.
>>>>>> 780.893.7326 (m)
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> Sean Solbak, BsC, MCSD
>>>> Solbak Technologies Inc.
>>>> 780.893.7326 (m)
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>>
>> Sean Solbak, BsC, MCSD
>> Solbak Technologies Inc.
>> 780.893.7326 (m)
>>
>
>


-- 
Thanks,

Sean Solbak, BsC, MCSD
Solbak Technologies Inc.
780.893.7326 (m)

Re: Tuning and nimbus at 99%

Posted by Michael Rose <mi...@fullcontact.com>.

The fact that the process is being killed constantly is a red flag. Also,
why are you running it as a client VM?

Check your nimbus.log to see why it's restarting.

Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
michael@fullcontact.com


On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <se...@solbak.ca> wrote:

>   uintx ErgoHeapSizeLimit                         = 0
> {product}
>     uintx InitialHeapSize                          := 27080896
>  {product}
>     uintx LargePageHeapSizeThreshold                = 134217728
> {product}
>     uintx MaxHeapSize                              := 698351616
> {product}
>
>
> so initial size of ~25mb and max of ~666 mb
>
> Its a client process (not server ie the command is "java -client
> -Dstorm.options...").  The process gets killed and restarted continously
> with a new PID (which makes getting the PID tough to get stats on).  I dont
> have VisualVM but if I run
>
> jstat -gc PID, I get
>
>  S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC
>   PU    YGC     YGCT    FGC    FGCT     GCT
> 832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0
> 21248.0 16029.6      5    0.268   0      0.000    0.268
>
> At this point I'll likely just rebuild the cluster.  Its not in prod yet
> as I still need to tune it.  I should have wrote 2 separate emails :)
>
> Thanks,
> S
>
>
>
>
> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <mi...@fullcontact.com>wrote:
>
>> I'm not seeing too much to substantiate that. What size heap are you
>> running, and is it near filled? Perhaps attach VisualVM and check for GC
>> activity.
>>
>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>> michael@fullcontact.com
>>
>>
>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <se...@solbak.ca> wrote:
>>
>>> Here it is.  Appears to be some kind of race condition.
>>>
>>> http://pastebin.com/dANT8SQR
>>>
>>>
>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>
>>>> Can you do a thread dump and pastebin it? It's a nice first step to
>>>> figure this out.
>>>>
>>>> I just checked on our Nimbus and while it's on a larger machine, it's
>>>> using <1% CPU. Also look in your logs for any clues.
>>>>
>>>>
>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>> michael@fullcontact.com
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>
>>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>>> workers, 1 nimbus and 1 zookeeper.
>>>>>
>>>>> I suppose I could just create a new cluster but Id like to know why
>>>>> this is occurring to avoid future production outages.
>>>>>
>>>>> Thanks,
>>>>> S
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>>>
>>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>>
>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>> michael@fullcontact.com
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>
>>>>>>> This is the first step of 4. When I save to db I'm actually saving
>>>>>>> to a queue, (just using db for now).  The 2nd step we index the data and
>>>>>>> 3rd we do aggregation/counts for reporting.  The last is a search that I'm
>>>>>>> planning on using drpc for.  Within step 2 we pipe certain datasets in real
>>>>>>> time to the clients it applies to.  I'd like this and the drpc to be sub 2s
>>>>>>> which should be reasonable.
>>>>>>>
>>>>>>> Your right that I could speed up step 1 by not using trident but our
>>>>>>> requirements seem like a good use case for the other 3 steps.  With many
>>>>>>> results per second batching should effect performance a ton if the batch
>>>>>>> size is small enough.
>>>>>>>
>>>>>>> What would cause nimbus to be at 100% CPU with the topologies
>>>>>>> killed?
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <se...@monkeysnatchbanana.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Is there a reason you are using trident?
>>>>>>>
>>>>>>> If you don't need to handle the events as a batch, you are probably
>>>>>>> going to get performance w/o it.
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>>
>>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>>
>>>>>>>> - 4 spouts of events
>>>>>>>> - merges into one stream
>>>>>>>> - serializes the object as an event in a string
>>>>>>>> - saves to db
>>>>>>>>
>>>>>>>> I split the serialization task away from the spout as it was cpu
>>>>>>>> intensive to speed it up.
>>>>>>>>
>>>>>>>> The problem I have is that after 10 minutes there is over 910k
>>>>>>>> tuples emitted/transfered but only 193k records are saved.
>>>>>>>>
>>>>>>>> The overall load of the topology seems fine.
>>>>>>>>
>>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>>> - The highest capacity of any bolt is 0.3 which is the
>>>>>>>> serialization one.
>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms
>>>>>>>> process latency.
>>>>>>>>
>>>>>>>> So it seems trident has all the records internally, but I need
>>>>>>>> these events as close to realtime as possible.
>>>>>>>>
>>>>>>>> Does anyone have any guidance as to how to increase the throughput?
>>>>>>>>  Is it simply a matter of tweeking max spout pending and the batch size?
>>>>>>>>
>>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>>> upgrade it until the demand on the boxes seems higher.  Although CPU usage
>>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at
>>>>>>>> 99% even when all the topologies are killed.
>>>>>>>>
>>>>>>>> We are currently targeting processing 200 million records per day
>>>>>>>> which seems like it should be quite easy based on what Ive read that other
>>>>>>>> people have achieved.  I realize that hardware should be able to boost this
>>>>>>>> as well but my first goal is to get trident to push the records to the db
>>>>>>>> quicker.
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Sean
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ce n'est pas une signature
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>>
>>>>> Sean Solbak, BsC, MCSD
>>>>> Solbak Technologies Inc.
>>>>> 780.893.7326 (m)
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>>
>>> Sean Solbak, BsC, MCSD
>>> Solbak Technologies Inc.
>>> 780.893.7326 (m)
>>>
>>
>>
>
>
> --
> Thanks,
>
> Sean Solbak, BsC, MCSD
> Solbak Technologies Inc.
> 780.893.7326 (m)
>

Re: Tuning and nimbus at 99%

Posted by Sean Solbak <se...@solbak.ca>.

  uintx ErgoHeapSizeLimit                         = 0
{product}
    uintx InitialHeapSize                          := 27080896
 {product}
    uintx LargePageHeapSizeThreshold                = 134217728
{product}
    uintx MaxHeapSize                              := 698351616
{product}


so initial size of ~25mb and max of ~666 mb

Its a client process (not server ie the command is "java -client
-Dstorm.options...").  The process gets killed and restarted continously
with a new PID (which makes getting the PID tough to get stats on).  I dont
have VisualVM but if I run

jstat -gc PID, I get

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC
  PU    YGC     YGCT    FGC    FGCT     GCT
832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0   21248.0
16029.6      5    0.268   0      0.000    0.268

At this point I'll likely just rebuild the cluster.  Its not in prod yet as
I still need to tune it.  I should have wrote 2 separate emails :)

Thanks,
S




On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <mi...@fullcontact.com>wrote:

> I'm not seeing too much to substantiate that. What size heap are you
> running, and is it near filled? Perhaps attach VisualVM and check for GC
> activity.
>
>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> michael@fullcontact.com
>
>
> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <se...@solbak.ca> wrote:
>
>> Here it is.  Appears to be some kind of race condition.
>>
>> http://pastebin.com/dANT8SQR
>>
>>
>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>
>>> Can you do a thread dump and pastebin it? It's a nice first step to
>>> figure this out.
>>>
>>> I just checked on our Nimbus and while it's on a larger machine, it's
>>> using <1% CPU. Also look in your logs for any clues.
>>>
>>>
>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>> michael@fullcontact.com
>>>
>>>
>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>
>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>> workers, 1 nimbus and 1 zookeeper.
>>>>
>>>> I suppose I could just create a new cluster but Id like to know why
>>>> this is occurring to avoid future production outages.
>>>>
>>>> Thanks,
>>>> S
>>>>
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>>
>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>
>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>> michael@fullcontact.com
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>
>>>>>> This is the first step of 4. When I save to db I'm actually saving to
>>>>>> a queue, (just using db for now).  The 2nd step we index the data and 3rd
>>>>>> we do aggregation/counts for reporting.  The last is a search that I'm
>>>>>> planning on using drpc for.  Within step 2 we pipe certain datasets in real
>>>>>> time to the clients it applies to.  I'd like this and the drpc to be sub 2s
>>>>>> which should be reasonable.
>>>>>>
>>>>>> Your right that I could speed up step 1 by not using trident but our
>>>>>> requirements seem like a good use case for the other 3 steps.  With many
>>>>>> results per second batching should effect performance a ton if the batch
>>>>>> size is small enough.
>>>>>>
>>>>>> What would cause nimbus to be at 100% CPU with the topologies killed?
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <se...@monkeysnatchbanana.com>
>>>>>> wrote:
>>>>>>
>>>>>> Is there a reason you are using trident?
>>>>>>
>>>>>> If you don't need to handle the events as a batch, you are probably
>>>>>> going to get performance w/o it.
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>>
>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>
>>>>>>> - 4 spouts of events
>>>>>>> - merges into one stream
>>>>>>> - serializes the object as an event in a string
>>>>>>> - saves to db
>>>>>>>
>>>>>>> I split the serialization task away from the spout as it was cpu
>>>>>>> intensive to speed it up.
>>>>>>>
>>>>>>> The problem I have is that after 10 minutes there is over 910k
>>>>>>> tuples emitted/transfered but only 193k records are saved.
>>>>>>>
>>>>>>> The overall load of the topology seems fine.
>>>>>>>
>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>> - The highest capacity of any bolt is 0.3 which is the serialization
>>>>>>> one.
>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms process
>>>>>>> latency.
>>>>>>>
>>>>>>> So it seems trident has all the records internally, but I need these
>>>>>>> events as close to realtime as possible.
>>>>>>>
>>>>>>> Does anyone have any guidance as to how to increase the throughput?
>>>>>>>  Is it simply a matter of tweeking max spout pending and the batch size?
>>>>>>>
>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>> upgrade it until the demand on the boxes seems higher.  Although CPU usage
>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at
>>>>>>> 99% even when all the topologies are killed.
>>>>>>>
>>>>>>> We are currently targeting processing 200 million records per day
>>>>>>> which seems like it should be quite easy based on what Ive read that other
>>>>>>> people have achieved.  I realize that hardware should be able to boost this
>>>>>>> as well but my first goal is to get trident to push the records to the db
>>>>>>> quicker.
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Sean
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Ce n'est pas une signature
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> Sean Solbak, BsC, MCSD
>>>> Solbak Technologies Inc.
>>>> 780.893.7326 (m)
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>>
>> Sean Solbak, BsC, MCSD
>> Solbak Technologies Inc.
>> 780.893.7326 (m)
>>
>
>


-- 
Thanks,

Sean Solbak, BsC, MCSD
Solbak Technologies Inc.
780.893.7326 (m)

Re: Tuning and nimbus at 99%

Posted by Michael Rose <mi...@fullcontact.com>.

I'm not seeing too much to substantiate that. What size heap are you
running, and is it near filled? Perhaps attach VisualVM and check for GC
activity.

Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
michael@fullcontact.com


On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <se...@solbak.ca> wrote:

> Here it is.  Appears to be some kind of race condition.
>
> http://pastebin.com/dANT8SQR
>
>
> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <mi...@fullcontact.com>wrote:
>
>> Can you do a thread dump and pastebin it? It's a nice first step to
>> figure this out.
>>
>> I just checked on our Nimbus and while it's on a larger machine, it's
>> using <1% CPU. Also look in your logs for any clues.
>>
>>
>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>> michael@fullcontact.com
>>
>>
>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <se...@solbak.ca> wrote:
>>
>>> No, they are on seperate machines.  Its a 4 machine cluster - 2 workers,
>>> 1 nimbus and 1 zookeeper.
>>>
>>> I suppose I could just create a new cluster but Id like to know why this
>>> is occurring to avoid future production outages.
>>>
>>> Thanks,
>>> S
>>>
>>>
>>>
>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>>
>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>
>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>> michael@fullcontact.com
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>
>>>>> This is the first step of 4. When I save to db I'm actually saving to
>>>>> a queue, (just using db for now).  The 2nd step we index the data and 3rd
>>>>> we do aggregation/counts for reporting.  The last is a search that I'm
>>>>> planning on using drpc for.  Within step 2 we pipe certain datasets in real
>>>>> time to the clients it applies to.  I'd like this and the drpc to be sub 2s
>>>>> which should be reasonable.
>>>>>
>>>>> Your right that I could speed up step 1 by not using trident but our
>>>>> requirements seem like a good use case for the other 3 steps.  With many
>>>>> results per second batching should effect performance a ton if the batch
>>>>> size is small enough.
>>>>>
>>>>> What would cause nimbus to be at 100% CPU with the topologies killed?
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <se...@monkeysnatchbanana.com>
>>>>> wrote:
>>>>>
>>>>> Is there a reason you are using trident?
>>>>>
>>>>> If you don't need to handle the events as a batch, you are probably
>>>>> going to get performance w/o it.
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>>
>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>
>>>>>> - 4 spouts of events
>>>>>> - merges into one stream
>>>>>> - serializes the object as an event in a string
>>>>>> - saves to db
>>>>>>
>>>>>> I split the serialization task away from the spout as it was cpu
>>>>>> intensive to speed it up.
>>>>>>
>>>>>> The problem I have is that after 10 minutes there is over 910k tuples
>>>>>> emitted/transfered but only 193k records are saved.
>>>>>>
>>>>>> The overall load of the topology seems fine.
>>>>>>
>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>> - The highest capacity of any bolt is 0.3 which is the serialization
>>>>>> one.
>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms process
>>>>>> latency.
>>>>>>
>>>>>> So it seems trident has all the records internally, but I need these
>>>>>> events as close to realtime as possible.
>>>>>>
>>>>>> Does anyone have any guidance as to how to increase the throughput?
>>>>>>  Is it simply a matter of tweeking max spout pending and the batch size?
>>>>>>
>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to upgrade
>>>>>> it until the demand on the boxes seems higher.  Although CPU usage on the
>>>>>> nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at 99%
>>>>>> even when all the topologies are killed.
>>>>>>
>>>>>> We are currently targeting processing 200 million records per day
>>>>>> which seems like it should be quite easy based on what Ive read that other
>>>>>> people have achieved.  I realize that hardware should be able to boost this
>>>>>> as well but my first goal is to get trident to push the records to the db
>>>>>> quicker.
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Sean
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Ce n'est pas une signature
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>>
>>> Sean Solbak, BsC, MCSD
>>> Solbak Technologies Inc.
>>> 780.893.7326 (m)
>>>
>>
>>
>
>
> --
> Thanks,
>
> Sean Solbak, BsC, MCSD
> Solbak Technologies Inc.
> 780.893.7326 (m)
>

Re: Tuning and nimbus at 99%

Posted by Sean Solbak <se...@solbak.ca>.

Here it is.  Appears to be some kind of race condition.

http://pastebin.com/dANT8SQR


On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <mi...@fullcontact.com>wrote:

> Can you do a thread dump and pastebin it? It's a nice first step to figure
> this out.
>
> I just checked on our Nimbus and while it's on a larger machine, it's
> using <1% CPU. Also look in your logs for any clues.
>
>
> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> michael@fullcontact.com
>
>
> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <se...@solbak.ca> wrote:
>
>> No, they are on seperate machines.  Its a 4 machine cluster - 2 workers,
>> 1 nimbus and 1 zookeeper.
>>
>> I suppose I could just create a new cluster but Id like to know why this
>> is occurring to avoid future production outages.
>>
>> Thanks,
>> S
>>
>>
>>
>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>
>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>
>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>> michael@fullcontact.com
>>>
>>>
>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>
>>>> This is the first step of 4. When I save to db I'm actually saving to a
>>>> queue, (just using db for now).  The 2nd step we index the data and 3rd we
>>>> do aggregation/counts for reporting.  The last is a search that I'm
>>>> planning on using drpc for.  Within step 2 we pipe certain datasets in real
>>>> time to the clients it applies to.  I'd like this and the drpc to be sub 2s
>>>> which should be reasonable.
>>>>
>>>> Your right that I could speed up step 1 by not using trident but our
>>>> requirements seem like a good use case for the other 3 steps.  With many
>>>> results per second batching should effect performance a ton if the batch
>>>> size is small enough.
>>>>
>>>> What would cause nimbus to be at 100% CPU with the topologies killed?
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <se...@monkeysnatchbanana.com>
>>>> wrote:
>>>>
>>>> Is there a reason you are using trident?
>>>>
>>>> If you don't need to handle the events as a batch, you are probably
>>>> going to get performance w/o it.
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>>
>>>>> Im writing a fairly basic trident topology as follows:
>>>>>
>>>>> - 4 spouts of events
>>>>> - merges into one stream
>>>>> - serializes the object as an event in a string
>>>>> - saves to db
>>>>>
>>>>> I split the serialization task away from the spout as it was cpu
>>>>> intensive to speed it up.
>>>>>
>>>>> The problem I have is that after 10 minutes there is over 910k tuples
>>>>> emitted/transfered but only 193k records are saved.
>>>>>
>>>>> The overall load of the topology seems fine.
>>>>>
>>>>> - 536.404 ms complete latency at the topolgy level
>>>>> - The highest capacity of any bolt is 0.3 which is the serialization
>>>>> one.
>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms process
>>>>> latency.
>>>>>
>>>>> So it seems trident has all the records internally, but I need these
>>>>> events as close to realtime as possible.
>>>>>
>>>>> Does anyone have any guidance as to how to increase the throughput?
>>>>>  Is it simply a matter of tweeking max spout pending and the batch size?
>>>>>
>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to upgrade
>>>>> it until the demand on the boxes seems higher.  Although CPU usage on the
>>>>> nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at 99%
>>>>> even when all the topologies are killed.
>>>>>
>>>>> We are currently targeting processing 200 million records per day
>>>>> which seems like it should be quite easy based on what Ive read that other
>>>>> people have achieved.  I realize that hardware should be able to boost this
>>>>> as well but my first goal is to get trident to push the records to the db
>>>>> quicker.
>>>>>
>>>>> Thanks in advance,
>>>>> Sean
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ce n'est pas une signature
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks,
>>
>> Sean Solbak, BsC, MCSD
>> Solbak Technologies Inc.
>> 780.893.7326 (m)
>>
>
>


-- 
Thanks,

Sean Solbak, BsC, MCSD
Solbak Technologies Inc.
780.893.7326 (m)

Re: Tuning and nimbus at 99%

Posted by Michael Rose <mi...@fullcontact.com>.

Can you do a thread dump and pastebin it? It's a nice first step to figure
this out.

I just checked on our Nimbus and while it's on a larger machine, it's using
<1% CPU. Also look in your logs for any clues.


Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
michael@fullcontact.com


On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <se...@solbak.ca> wrote:

> No, they are on seperate machines.  Its a 4 machine cluster - 2 workers, 1
> nimbus and 1 zookeeper.
>
> I suppose I could just create a new cluster but Id like to know why this
> is occurring to avoid future production outages.
>
> Thanks,
> S
>
>
>
> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <mi...@fullcontact.com>wrote:
>
>> Are you running Zookeeper on the same machine as the Nimbus box?
>>
>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>> michael@fullcontact.com
>>
>>
>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca> wrote:
>>
>>> This is the first step of 4. When I save to db I'm actually saving to a
>>> queue, (just using db for now).  The 2nd step we index the data and 3rd we
>>> do aggregation/counts for reporting.  The last is a search that I'm
>>> planning on using drpc for.  Within step 2 we pipe certain datasets in real
>>> time to the clients it applies to.  I'd like this and the drpc to be sub 2s
>>> which should be reasonable.
>>>
>>> Your right that I could speed up step 1 by not using trident but our
>>> requirements seem like a good use case for the other 3 steps.  With many
>>> results per second batching should effect performance a ton if the batch
>>> size is small enough.
>>>
>>> What would cause nimbus to be at 100% CPU with the topologies killed?
>>>
>>> Sent from my iPhone
>>>
>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <se...@monkeysnatchbanana.com>
>>> wrote:
>>>
>>> Is there a reason you are using trident?
>>>
>>> If you don't need to handle the events as a batch, you are probably
>>> going to get performance w/o it.
>>>
>>>
>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca> wrote:
>>>
>>>> Im writing a fairly basic trident topology as follows:
>>>>
>>>> - 4 spouts of events
>>>> - merges into one stream
>>>> - serializes the object as an event in a string
>>>> - saves to db
>>>>
>>>> I split the serialization task away from the spout as it was cpu
>>>> intensive to speed it up.
>>>>
>>>> The problem I have is that after 10 minutes there is over 910k tuples
>>>> emitted/transfered but only 193k records are saved.
>>>>
>>>> The overall load of the topology seems fine.
>>>>
>>>> - 536.404 ms complete latency at the topolgy level
>>>> - The highest capacity of any bolt is 0.3 which is the serialization
>>>> one.
>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms process
>>>> latency.
>>>>
>>>> So it seems trident has all the records internally, but I need these
>>>> events as close to realtime as possible.
>>>>
>>>> Does anyone have any guidance as to how to increase the throughput?  Is
>>>> it simply a matter of tweeking max spout pending and the batch size?
>>>>
>>>> Im running it on 2 m1-smalls for now.  I dont see the need to upgrade
>>>> it until the demand on the boxes seems higher.  Although CPU usage on the
>>>> nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at 99%
>>>> even when all the topologies are killed.
>>>>
>>>> We are currently targeting processing 200 million records per day which
>>>> seems like it should be quite easy based on what Ive read that other people
>>>> have achieved.  I realize that hardware should be able to boost this as
>>>> well but my first goal is to get trident to push the records to the db
>>>> quicker.
>>>>
>>>> Thanks in advance,
>>>> Sean
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Ce n'est pas une signature
>>>
>>>
>>
>
>
> --
> Thanks,
>
> Sean Solbak, BsC, MCSD
> Solbak Technologies Inc.
> 780.893.7326 (m)
>

Re: Tuning and nimbus at 99%

Posted by Sean Solbak <se...@solbak.ca>.

No, they are on seperate machines.  Its a 4 machine cluster - 2 workers, 1
nimbus and 1 zookeeper.

I suppose I could just create a new cluster but Id like to know why this is
occurring to avoid future production outages.

Thanks,
S



On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <mi...@fullcontact.com>wrote:

> Are you running Zookeeper on the same machine as the Nimbus box?
>
> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> michael@fullcontact.com
>
>
> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca> wrote:
>
>> This is the first step of 4. When I save to db I'm actually saving to a
>> queue, (just using db for now).  The 2nd step we index the data and 3rd we
>> do aggregation/counts for reporting.  The last is a search that I'm
>> planning on using drpc for.  Within step 2 we pipe certain datasets in real
>> time to the clients it applies to.  I'd like this and the drpc to be sub 2s
>> which should be reasonable.
>>
>> Your right that I could speed up step 1 by not using trident but our
>> requirements seem like a good use case for the other 3 steps.  With many
>> results per second batching should effect performance a ton if the batch
>> size is small enough.
>>
>> What would cause nimbus to be at 100% CPU with the topologies killed?
>>
>> Sent from my iPhone
>>
>> On Mar 2, 2014, at 5:46 PM, Sean Allen <se...@monkeysnatchbanana.com>
>> wrote:
>>
>> Is there a reason you are using trident?
>>
>> If you don't need to handle the events as a batch, you are probably going
>> to get performance w/o it.
>>
>>
>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca> wrote:
>>
>>> Im writing a fairly basic trident topology as follows:
>>>
>>> - 4 spouts of events
>>> - merges into one stream
>>> - serializes the object as an event in a string
>>> - saves to db
>>>
>>> I split the serialization task away from the spout as it was cpu
>>> intensive to speed it up.
>>>
>>> The problem I have is that after 10 minutes there is over 910k tuples
>>> emitted/transfered but only 193k records are saved.
>>>
>>> The overall load of the topology seems fine.
>>>
>>> - 536.404 ms complete latency at the topolgy level
>>> - The highest capacity of any bolt is 0.3 which is the serialization one.
>>> - each bolt task has sub 20 ms execute latency and sub 40 ms process
>>> latency.
>>>
>>> So it seems trident has all the records internally, but I need these
>>> events as close to realtime as possible.
>>>
>>> Does anyone have any guidance as to how to increase the throughput?  Is
>>> it simply a matter of tweeking max spout pending and the batch size?
>>>
>>> Im running it on 2 m1-smalls for now.  I dont see the need to upgrade it
>>> until the demand on the boxes seems higher.  Although CPU usage on the
>>> nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at 99%
>>> even when all the topologies are killed.
>>>
>>> We are currently targeting processing 200 million records per day which
>>> seems like it should be quite easy based on what Ive read that other people
>>> have achieved.  I realize that hardware should be able to boost this as
>>> well but my first goal is to get trident to push the records to the db
>>> quicker.
>>>
>>> Thanks in advance,
>>> Sean
>>>
>>>
>>
>>
>> --
>>
>> Ce n'est pas une signature
>>
>>
>


-- 
Thanks,

Sean Solbak, BsC, MCSD
Solbak Technologies Inc.
780.893.7326 (m)

Re: Tuning and nimbus at 99%

Posted by Michael Rose <mi...@fullcontact.com>.

Are you running Zookeeper on the same machine as the Nimbus box?

Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
michael@fullcontact.com


On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <se...@solbak.ca> wrote:

> This is the first step of 4. When I save to db I'm actually saving to a
> queue, (just using db for now).  The 2nd step we index the data and 3rd we
> do aggregation/counts for reporting.  The last is a search that I'm
> planning on using drpc for.  Within step 2 we pipe certain datasets in real
> time to the clients it applies to.  I'd like this and the drpc to be sub 2s
> which should be reasonable.
>
> Your right that I could speed up step 1 by not using trident but our
> requirements seem like a good use case for the other 3 steps.  With many
> results per second batching should effect performance a ton if the batch
> size is small enough.
>
> What would cause nimbus to be at 100% CPU with the topologies killed?
>
> Sent from my iPhone
>
> On Mar 2, 2014, at 5:46 PM, Sean Allen <se...@monkeysnatchbanana.com>
> wrote:
>
> Is there a reason you are using trident?
>
> If you don't need to handle the events as a batch, you are probably going
> to get performance w/o it.
>
>
> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca> wrote:
>
>> Im writing a fairly basic trident topology as follows:
>>
>> - 4 spouts of events
>> - merges into one stream
>> - serializes the object as an event in a string
>> - saves to db
>>
>> I split the serialization task away from the spout as it was cpu
>> intensive to speed it up.
>>
>> The problem I have is that after 10 minutes there is over 910k tuples
>> emitted/transfered but only 193k records are saved.
>>
>> The overall load of the topology seems fine.
>>
>> - 536.404 ms complete latency at the topolgy level
>> - The highest capacity of any bolt is 0.3 which is the serialization one.
>> - each bolt task has sub 20 ms execute latency and sub 40 ms process
>> latency.
>>
>> So it seems trident has all the records internally, but I need these
>> events as close to realtime as possible.
>>
>> Does anyone have any guidance as to how to increase the throughput?  Is
>> it simply a matter of tweeking max spout pending and the batch size?
>>
>> Im running it on 2 m1-smalls for now.  I dont see the need to upgrade it
>> until the demand on the boxes seems higher.  Although CPU usage on the
>> nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at 99%
>> even when all the topologies are killed.
>>
>> We are currently targeting processing 200 million records per day which
>> seems like it should be quite easy based on what Ive read that other people
>> have achieved.  I realize that hardware should be able to boost this as
>> well but my first goal is to get trident to push the records to the db
>> quicker.
>>
>> Thanks in advance,
>> Sean
>>
>>
>
>
> --
>
> Ce n'est pas une signature
>
>

Re: Tuning and nimbus at 99%

Posted by Sean Solbak <se...@solbak.ca>.

This is the first step of 4. When I save to db I'm actually saving to a queue, (just using db for now).  The 2nd step we index the data and 3rd we do aggregation/counts for reporting.  The last is a search that I'm planning on using drpc for.  Within step 2 we pipe certain datasets in real time to the clients it applies to.  I'd like this and the drpc to be sub 2s which should be reasonable.

Your right that I could speed up step 1 by not using trident but our requirements seem like a good use case for the other 3 steps.  With many results per second batching should effect performance a ton if the batch size is small enough.

What would cause nimbus to be at 100% CPU with the topologies killed? 

Sent from my iPhone

> On Mar 2, 2014, at 5:46 PM, Sean Allen <se...@monkeysnatchbanana.com> wrote:
> 
> Is there a reason you are using trident? 
> 
> If you don't need to handle the events as a batch, you are probably going to get performance w/o it.
> 
> 
>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca> wrote:
>> Im writing a fairly basic trident topology as follows:
>> 
>> - 4 spouts of events
>> - merges into one stream
>> - serializes the object as an event in a string
>> - saves to db
>> 
>> I split the serialization task away from the spout as it was cpu intensive to speed it up.
>> 
>> The problem I have is that after 10 minutes there is over 910k tuples emitted/transfered but only 193k records are saved.
>> 
>> The overall load of the topology seems fine.
>>  
>> - 536.404 ms complete latency at the topolgy level
>> - The highest capacity of any bolt is 0.3 which is the serialization one.
>> - each bolt task has sub 20 ms execute latency and sub 40 ms process latency.
>> 
>> So it seems trident has all the records internally, but I need these events as close to realtime as possible.
>> 
>> Does anyone have any guidance as to how to increase the throughput?  Is it simply a matter of tweeking max spout pending and the batch size?
>> 
>> Im running it on 2 m1-smalls for now.  I dont see the need to upgrade it until the demand on the boxes seems higher.  Although CPU usage on the nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at 99% even when all the topologies are killed.
>> 
>> We are currently targeting processing 200 million records per day which seems like it should be quite easy based on what Ive read that other people have achieved.  I realize that hardware should be able to boost this as well but my first goal is to get trident to push the records to the db quicker.
>> 
>> Thanks in advance,
>> Sean
> 
> 
> 
> -- 
> 
> Ce n'est pas une signature

Re: Tuning and nimbus at 99%

Posted by Sean Allen <se...@monkeysnatchbanana.com>.

Is there a reason you are using trident?

If you don't need to handle the events as a batch, you are probably going
to get performance w/o it.


On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <se...@solbak.ca> wrote:

> Im writing a fairly basic trident topology as follows:
>
> - 4 spouts of events
> - merges into one stream
> - serializes the object as an event in a string
> - saves to db
>
> I split the serialization task away from the spout as it was cpu intensive
> to speed it up.
>
> The problem I have is that after 10 minutes there is over 910k tuples
> emitted/transfered but only 193k records are saved.
>
> The overall load of the topology seems fine.
>
> - 536.404 ms complete latency at the topolgy level
> - The highest capacity of any bolt is 0.3 which is the serialization one.
> - each bolt task has sub 20 ms execute latency and sub 40 ms process
> latency.
>
> So it seems trident has all the records internally, but I need these
> events as close to realtime as possible.
>
> Does anyone have any guidance as to how to increase the throughput?  Is it
> simply a matter of tweeking max spout pending and the batch size?
>
> Im running it on 2 m1-smalls for now.  I dont see the need to upgrade it
> until the demand on the boxes seems higher.  Although CPU usage on the
> nimbus box is pinned.  Its at like 99%.  Why would that be?  Its at 99%
> even when all the topologies are killed.
>
> We are currently targeting processing 200 million records per day which
> seems like it should be quite easy based on what Ive read that other people
> have achieved.  I realize that hardware should be able to boost this as
> well but my first goal is to get trident to push the records to the db
> quicker.
>
> Thanks in advance,
> Sean
>
>


-- 

Ce n'est pas une signature