You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Victor Godoy Poluceno <vi...@gmail.com> on 2015/03/10 20:37:48 UTC

Storm 0.9.3 ShellBolt stops processing tuples after a while

Hi,

I've a log processing topology using Storm 0.9.3 with Kafka as Spout and
bolts using ShellBolt with multilang through Pyleus (Python).

Everything runs fine under normal pace (if there is no big lag to be
consumed). But, with any offset lag higher than 100 messages per partition,
this topology stop processing/acking tuples after processing only a few
tuples, around 200 total.

The whole processing is very CPU bound, as I need to parse, roll and
aggregate metrics before sending time series data to Cassandra. By the way,
we are processing big tuples, around 0.8 MB each.

This topology has 1 spout, and 2 bolts, one bolt that aggregates and other
that just commits. Those bolts are connected using a fields grouping.

My storm.yml settings:

worker.childopts: "-Xmx1G -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=1%ID%
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false -Djava.net.preferIPv4Stack=true"
storm.messaging.netty.max_wait_ms: 3000
storm.messaging.netty.min_wait_ms: 300
storm.messaging.netty.max_retries: 90

storm.messaging.netty.buffer_size: 1048576 #1MB buffer
topology.receiver.buffer.size: 2
topology.transfer.buffer.size: 4
topology.executor.receive.buffer.size: 2
topology.executor.send.buffer.size: 2

I've reduced all buffer configs because of size of my tuples, but that
didn't show any improvements. This topology runs on 2 dedicated servers
with 8 cores and 16 GB of RAM on each. I'm running 12 workers, with
parallelism hint of 12 for the Kafka spout, and 12 for the aggregate bolt
and 12 for commit bolt.

Also, I've set max spout pending to 100 and max shell bolt pending to 1000.
But changing these numbers didn't seem to have any real impact.

After a lot of investigations I've restricted the problem to emitting
tuples from the aggregate bolt to the commit bolt. If I just don't emit any
tuples from the aggregate bolt, everything works fine. I don't known if
this has anything to do with ShellBolt or Pyleus itself.

This screenshot
<https://s3-sa-east-1.amazonaws.com/uploads-br.hipchat.com/114883/858008/SfP6b6YsX4k8jHo/screen.jpg>
shows this topology running during a 5 minutes window with around 150 of
lag on each Kafka partition (Kafka with 12 partitions).

Any thoughts on why this is happening?

--
hooray!

--
Victor Godoy Poluceno

Re: Storm 0.9.3 ShellBolt stops processing tuples after a while

Posted by Victor Godoy Poluceno <vi...@gmail.com>.

I'm using Pyleus 0.2.4 but I've backported pyleus's changes [1] that add
heartbeat support to my own Python code. So, my bolts should be responding
to heartbeat commands.

These are logs from 2 of 6 my workers and supervisor logs:

https://gist.github.com/victorpoluceno/0c296eac4f5954f7377c
https://gist.github.com/victorpoluceno/50dc1f1f7782ca4a86ce
https://gist.github.com/victorpoluceno/0020623046cd3807e802

Thank you for you time.

[1] -
https://github.com/Yelp/pyleus/commit/a929e425cd86d6f9cca87c6b5d37664184f19d99

On Wed, Mar 11, 2015 at 10:23 AM, 임정택 <ka...@gmail.com> wrote:

> What version of Pyleus do you use?
> AFAIK Pyleus doesn't have any release versions which support heartbeat
> introduced from 0.9.3.
> It's on develop branch.
>
> Could you check your worker / shellbolt subprocess log, and paste it if
> you don't mind?
>
> On 2015년 3월 11일 (수) at 오전 4:39 Victor Godoy Poluceno <
> victorpoluceno@gmail.com> wrote:
>
>> Hi,
>>
>> I've a log processing topology using Storm 0.9.3 with Kafka as Spout and
>> bolts using ShellBolt with multilang through Pyleus (Python).
>>
>> Everything runs fine under normal pace (if there is no big lag to be
>> consumed). But, with any offset lag higher than 100 messages per partition,
>> this topology stop processing/acking tuples after processing only a few
>> tuples, around 200 total.
>>
>> The whole processing is very CPU bound, as I need to parse, roll and
>> aggregate metrics before sending time series data to Cassandra. By the way,
>> we are processing big tuples, around 0.8 MB each.
>>
>> This topology has 1 spout, and 2 bolts, one bolt that aggregates and
>> other that just commits. Those bolts are connected using a fields grouping.
>>
>> My storm.yml settings:
>>
>> worker.childopts: "-Xmx1G -Dcom.sun.management.jmxremote
>> -Dcom.sun.management.jmxremote.port=1%ID%
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dcom.sun.management.jmxremote.ssl=false -Djava.net.preferIPv4Stack=true"
>> storm.messaging.netty.max_wait_ms: 3000
>> storm.messaging.netty.min_wait_ms: 300
>> storm.messaging.netty.max_retries: 90
>>
>> storm.messaging.netty.buffer_size: 1048576 #1MB buffer
>> topology.receiver.buffer.size: 2
>> topology.transfer.buffer.size: 4
>> topology.executor.receive.buffer.size: 2
>> topology.executor.send.buffer.size: 2
>>
>>
>> I've reduced all buffer configs because of size of my tuples, but that
>> didn't show any improvements. This topology runs on 2 dedicated servers
>> with 8 cores and 16 GB of RAM on each. I'm running 12 workers, with
>> parallelism hint of 12 for the Kafka spout, and 12 for the aggregate bolt
>> and 12 for commit bolt.
>>
>> Also, I've set max spout pending to 100 and max shell bolt pending to
>> 1000. But changing these numbers didn't seem to have any real impact.
>>
>> After a lot of investigations I've restricted the problem to emitting
>> tuples from the aggregate bolt to the commit bolt. If I just don't emit any
>> tuples from the aggregate bolt, everything works fine. I don't known if
>> this has anything to do with ShellBolt or Pyleus itself.
>>
>> This screenshot
>> <https://s3-sa-east-1.amazonaws.com/uploads-br.hipchat.com/114883/858008/SfP6b6YsX4k8jHo/screen.jpg>
>> shows this topology running during a 5 minutes window with around 150 of
>> lag on each Kafka partition (Kafka with 12 partitions).
>>
>> Any thoughts on why this is happening?
>>
>> --
>> hooray!
>>
>> --
>> Victor Godoy Poluceno
>>
>


-- 
hooray!

--
Victor Godoy Poluceno

Re: Storm 0.9.3 ShellBolt stops processing tuples after a while

Posted by 임정택 <ka...@gmail.com>.

What version of Pyleus do you use?
AFAIK Pyleus doesn't have any release versions which support heartbeat
introduced from 0.9.3.
It's on develop branch.

Could you check your worker / shellbolt subprocess log, and paste it if you
don't mind?
On 2015년 3월 11일 (수) at 오전 4:39 Victor Godoy Poluceno <
victorpoluceno@gmail.com> wrote:

> Hi,
>
> I've a log processing topology using Storm 0.9.3 with Kafka as Spout and
> bolts using ShellBolt with multilang through Pyleus (Python).
>
> Everything runs fine under normal pace (if there is no big lag to be
> consumed). But, with any offset lag higher than 100 messages per partition,
> this topology stop processing/acking tuples after processing only a few
> tuples, around 200 total.
>
> The whole processing is very CPU bound, as I need to parse, roll and
> aggregate metrics before sending time series data to Cassandra. By the way,
> we are processing big tuples, around 0.8 MB each.
>
> This topology has 1 spout, and 2 bolts, one bolt that aggregates and other
> that just commits. Those bolts are connected using a fields grouping.
>
> My storm.yml settings:
>
> worker.childopts: "-Xmx1G -Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.port=1%ID%
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false -Djava.net.preferIPv4Stack=true"
> storm.messaging.netty.max_wait_ms: 3000
> storm.messaging.netty.min_wait_ms: 300
> storm.messaging.netty.max_retries: 90
>
> storm.messaging.netty.buffer_size: 1048576 #1MB buffer
> topology.receiver.buffer.size: 2
> topology.transfer.buffer.size: 4
> topology.executor.receive.buffer.size: 2
> topology.executor.send.buffer.size: 2
>
>
> I've reduced all buffer configs because of size of my tuples, but that
> didn't show any improvements. This topology runs on 2 dedicated servers
> with 8 cores and 16 GB of RAM on each. I'm running 12 workers, with
> parallelism hint of 12 for the Kafka spout, and 12 for the aggregate bolt
> and 12 for commit bolt.
>
> Also, I've set max spout pending to 100 and max shell bolt pending to
> 1000. But changing these numbers didn't seem to have any real impact.
>
> After a lot of investigations I've restricted the problem to emitting
> tuples from the aggregate bolt to the commit bolt. If I just don't emit any
> tuples from the aggregate bolt, everything works fine. I don't known if
> this has anything to do with ShellBolt or Pyleus itself.
>
> This screenshot
> <https://s3-sa-east-1.amazonaws.com/uploads-br.hipchat.com/114883/858008/SfP6b6YsX4k8jHo/screen.jpg>
> shows this topology running during a 5 minutes window with around 150 of
> lag on each Kafka partition (Kafka with 12 partitions).
>
> Any thoughts on why this is happening?
>
> --
> hooray!
>
> --
> Victor Godoy Poluceno
>