You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Raphael Hsieh <ra...@gmail.com> on 2014/06/02 22:25:04 UTC

Batch size never seems to increase

I'm trying to figure out why my batch size never seems to get any bigger
than 83K tuples.
It could be because this is the throughput of my spout, however I don't
believe this to be the case as I believe the spout is backing up (i'm not
processing the tuples as quickly as they are being produced)

Currently I'm just using a barebones topology that looks like this:

Stream spout = topology.newStream("...", ...)
   .parallelismHint(x)
   .groupBy("new Fields("time"))
   .aggregate(new Count(), new Fields("count"))
   .parallelismHint(x)
   .each(new Fields("time", "count"), new PrintFilter());

All the stream is doing is aggregating on like timestamps and printing out
the count.

in my config I've set batch size to 10mb like so:
Config config = new Config();
config.(RichSpoutBatchExecutor.MAX_BATCH_SIZE_CONF, 1024*1024*10);

when I have the batch size to 5mb or even 1mb there is no difference,
everything always adds up to roughly 83K tuples.

in order to count up how many tuples are in the batch, I take a look at the
system timestamp of when things are printed out (in the print filter) and
all the print statements that have the same timestamp, I add the count
values up together.

When I compare the system timestamp of when the batch was processed, and
the tuple timestamps (that they were aggregated on) I am falling behind.
This leads me to believe that the spout is emitting more than the number of
tuples I am processing, so there should be more than 83K tuples per batch.

If anyone has insight to this it would be greatly appreciated.
Thanks!
-- 
Raphael Hsieh