You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Shaikh Riyaz <sh...@gmail.com> on 2014/06/14 16:53:39 UTC

Help is processing huge data through Kafka-storm cluster

Hi,

Daily we are downloaded 28 Million of messages and Monthly it goes up to
800+ million.

We want to process this amount of data through our kafka and storm cluster
and would like to store in HBase cluster.

We are targeting to process one month of data in one day. Is it possible?

We have setup our cluster thinking that we can process million of messages
in one sec as mentioned on web. Unfortunately, we have ended-up with
processing only 1200-1700 message per second.  if we continue with this
speed than it will take min 10 days to process 30 days of data, which is
the relevant solution in our case.

I suspect that we have to change some configuration to achieve this goal.
Looking for help from experts to support me in achieving this task.

*Kafka Cluster:*
Kafka is running on two dedicated machines with 48 GB of RAM and 2TB of
storage. We have total 11 nodes kafka cluster spread across these two
servers.

*Kafka Configuration:*
producer.type=async
compression.codec=none
request.required.acks=-1
serializer.class=kafka.serializer.StringEncoder
queue.buffering.max.ms=100000
batch.num.messages=10000
queue.buffering.max.messages=100000
default.replication.factor=3
controlled.shutdown.enable=true
auto.leader.rebalance.enable=true
num.network.threads=2
num.io.threads=8
num.partitions=4
log.retention.hours=12
log.segment.bytes=536870912
log.retention.check.interval.ms=60000
log.cleaner.enable=false

*Storm Cluster:*
Storm is running with 5 supervisor and 1 nimbus on IBM servers with 48 GB
of RAM and 8TB of storage. These servers are shared with hbase cluster.

*Kafka spout configuration*
kafkaConfig.bufferSizeBytes = 1024*1024*8;
kafkaConfig.fetchSizeBytes = 1024*1024*4;
kafkaConfig.forceFromStart = true;

*Topology: StormTopology*
Spout           - Partition: 4
First Bolt -  parallelism hint: 6 and Num tasks: 5
Second Bolt -  parallelism hint: 5
Third Bolt -   parallelism hint: 3
Fourth Bolt   -  parallelism hint: 3 and Num tasks: 4
Fifth Bolt     -  parallelism hint: 3
Sixth Bolt -  parallelism hint: 3

*Supervisor configuration:*

storm.local.dir: "/app/storm"
storm.zookeeper.port: 2181
storm.cluster.mode: "distributed"
storm.local.mode.zmq: false
supervisor.slots.ports:
    - 6700
    - 6701
    - 6702
    - 6703
supervisor.worker.start.timeout.secs: 180
supervisor.worker.timeout.secs: 30
supervisor.monitor.frequency.secs: 3
supervisor.heartbeat.frequency.secs: 5
supervisor.enable: true

storm.messaging.netty.server_worker_threads: 2
storm.messaging.netty.client_worker_threads: 2
storm.messaging.netty.buffer_size: 52428800 #50MB buffer
storm.messaging.netty.max_retries: 25
storm.messaging.netty.max_wait_ms: 1000
storm.messaging.netty.min_wait_ms: 100


supervisor.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
worker.childopts: "-Xmx2048m -Djava.net.preferIPv4Stack=true"


Please let me know if more information needed..

Thanks in advance.

-- 
Regards,

Riyaz

Re: Help is processing huge data through Kafka-storm cluster

Posted by Neelesh <ne...@gmail.com>.
FWIW, setting the number of ackers to number of workers gave us a an order
of magnitude gains in latency on our small ec2 test cluster. Our next step
is to do simple microbatching/trident and see how that impacts latency


On Sat, Jun 14, 2014 at 8:28 AM, Haralds Ulmanis <ha...@evilezh.net>
wrote:

> And what about cpu/network/disk utilization ? And load factors per bolt
> from storm UI ?
>
>
> On 14 June 2014 15:53, Shaikh Riyaz <sh...@gmail.com> wrote:
>
>> Hi,
>>
>> Daily we are downloaded 28 Million of messages and Monthly it goes up to
>> 800+ million.
>>
>> We want to process this amount of data through our kafka and storm
>> cluster and would like to store in HBase cluster.
>>
>> We are targeting to process one month of data in one day. Is it possible?
>>
>> We have setup our cluster thinking that we can process million of
>> messages in one sec as mentioned on web. Unfortunately, we have ended-up
>> with processing only 1200-1700 message per second.  if we continue with
>> this speed than it will take min 10 days to process 30 days of data, which
>> is the relevant solution in our case.
>>
>> I suspect that we have to change some configuration to achieve this goal.
>> Looking for help from experts to support me in achieving this task.
>>
>> *Kafka Cluster:*
>> Kafka is running on two dedicated machines with 48 GB of RAM and 2TB of
>> storage. We have total 11 nodes kafka cluster spread across these two
>> servers.
>>
>> *Kafka Configuration:*
>> producer.type=async
>> compression.codec=none
>> request.required.acks=-1
>> serializer.class=kafka.serializer.StringEncoder
>> queue.buffering.max.ms=100000
>> batch.num.messages=10000
>> queue.buffering.max.messages=100000
>> default.replication.factor=3
>> controlled.shutdown.enable=true
>> auto.leader.rebalance.enable=true
>> num.network.threads=2
>> num.io.threads=8
>> num.partitions=4
>> log.retention.hours=12
>> log.segment.bytes=536870912
>> log.retention.check.interval.ms=60000
>> log.cleaner.enable=false
>>
>> *Storm Cluster:*
>> Storm is running with 5 supervisor and 1 nimbus on IBM servers with 48 GB
>> of RAM and 8TB of storage. These servers are shared with hbase cluster.
>>
>> *Kafka spout configuration*
>> kafkaConfig.bufferSizeBytes = 1024*1024*8;
>> kafkaConfig.fetchSizeBytes = 1024*1024*4;
>> kafkaConfig.forceFromStart = true;
>>
>> *Topology: StormTopology*
>> Spout           - Partition: 4
>> First Bolt -  parallelism hint: 6 and Num tasks: 5
>> Second Bolt -  parallelism hint: 5
>> Third Bolt -   parallelism hint: 3
>> Fourth Bolt   -  parallelism hint: 3 and Num tasks: 4
>> Fifth Bolt     -  parallelism hint: 3
>> Sixth Bolt -  parallelism hint: 3
>>
>> *Supervisor configuration:*
>>
>> storm.local.dir: "/app/storm"
>> storm.zookeeper.port: 2181
>> storm.cluster.mode: "distributed"
>> storm.local.mode.zmq: false
>> supervisor.slots.ports:
>>     - 6700
>>     - 6701
>>     - 6702
>>     - 6703
>> supervisor.worker.start.timeout.secs: 180
>> supervisor.worker.timeout.secs: 30
>> supervisor.monitor.frequency.secs: 3
>> supervisor.heartbeat.frequency.secs: 5
>> supervisor.enable: true
>>
>> storm.messaging.netty.server_worker_threads: 2
>> storm.messaging.netty.client_worker_threads: 2
>> storm.messaging.netty.buffer_size: 52428800 #50MB buffer
>> storm.messaging.netty.max_retries: 25
>> storm.messaging.netty.max_wait_ms: 1000
>> storm.messaging.netty.min_wait_ms: 100
>>
>>
>> supervisor.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
>> worker.childopts: "-Xmx2048m -Djava.net.preferIPv4Stack=true"
>>
>>
>> Please let me know if more information needed..
>>
>> Thanks in advance.
>>
>> --
>> Regards,
>>
>> Riyaz
>>
>>
>

Re: Help is processing huge data through Kafka-storm cluster

Posted by Haralds Ulmanis <ha...@evilezh.net>.
And what about cpu/network/disk utilization ? And load factors per bolt
from storm UI ?


On 14 June 2014 15:53, Shaikh Riyaz <sh...@gmail.com> wrote:

> Hi,
>
> Daily we are downloaded 28 Million of messages and Monthly it goes up to
> 800+ million.
>
> We want to process this amount of data through our kafka and storm cluster
> and would like to store in HBase cluster.
>
> We are targeting to process one month of data in one day. Is it possible?
>
> We have setup our cluster thinking that we can process million of messages
> in one sec as mentioned on web. Unfortunately, we have ended-up with
> processing only 1200-1700 message per second.  if we continue with this
> speed than it will take min 10 days to process 30 days of data, which is
> the relevant solution in our case.
>
> I suspect that we have to change some configuration to achieve this goal.
> Looking for help from experts to support me in achieving this task.
>
> *Kafka Cluster:*
> Kafka is running on two dedicated machines with 48 GB of RAM and 2TB of
> storage. We have total 11 nodes kafka cluster spread across these two
> servers.
>
> *Kafka Configuration:*
> producer.type=async
> compression.codec=none
> request.required.acks=-1
> serializer.class=kafka.serializer.StringEncoder
> queue.buffering.max.ms=100000
> batch.num.messages=10000
> queue.buffering.max.messages=100000
> default.replication.factor=3
> controlled.shutdown.enable=true
> auto.leader.rebalance.enable=true
> num.network.threads=2
> num.io.threads=8
> num.partitions=4
> log.retention.hours=12
> log.segment.bytes=536870912
> log.retention.check.interval.ms=60000
> log.cleaner.enable=false
>
> *Storm Cluster:*
> Storm is running with 5 supervisor and 1 nimbus on IBM servers with 48 GB
> of RAM and 8TB of storage. These servers are shared with hbase cluster.
>
> *Kafka spout configuration*
> kafkaConfig.bufferSizeBytes = 1024*1024*8;
> kafkaConfig.fetchSizeBytes = 1024*1024*4;
> kafkaConfig.forceFromStart = true;
>
> *Topology: StormTopology*
> Spout           - Partition: 4
> First Bolt -  parallelism hint: 6 and Num tasks: 5
> Second Bolt -  parallelism hint: 5
> Third Bolt -   parallelism hint: 3
> Fourth Bolt   -  parallelism hint: 3 and Num tasks: 4
> Fifth Bolt     -  parallelism hint: 3
> Sixth Bolt -  parallelism hint: 3
>
> *Supervisor configuration:*
>
> storm.local.dir: "/app/storm"
> storm.zookeeper.port: 2181
> storm.cluster.mode: "distributed"
> storm.local.mode.zmq: false
> supervisor.slots.ports:
>     - 6700
>     - 6701
>     - 6702
>     - 6703
> supervisor.worker.start.timeout.secs: 180
> supervisor.worker.timeout.secs: 30
> supervisor.monitor.frequency.secs: 3
> supervisor.heartbeat.frequency.secs: 5
> supervisor.enable: true
>
> storm.messaging.netty.server_worker_threads: 2
> storm.messaging.netty.client_worker_threads: 2
> storm.messaging.netty.buffer_size: 52428800 #50MB buffer
> storm.messaging.netty.max_retries: 25
> storm.messaging.netty.max_wait_ms: 1000
> storm.messaging.netty.min_wait_ms: 100
>
>
> supervisor.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
> worker.childopts: "-Xmx2048m -Djava.net.preferIPv4Stack=true"
>
>
> Please let me know if more information needed..
>
> Thanks in advance.
>
> --
> Regards,
>
> Riyaz
>
>