You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Daniel Butnaru <db...@danielbutnaru.de> on 2015/05/09 15:05:37 UTC

Misuse of Storm?

I am right now using Storm in an unorthodox way. While I am happy with the initial prototype I would like to know your opinion on potential problems I am overlooking.In our company we need to execute a cascade of image processing steps (bolts) on relatively large input data. A single input file (thus tuple) can range between 100MB and 100 GB. Of course, with such input size, the data is not placed in the stream but only a reference to it (e.g., file name). This is of course the first non-conformity with the intent of storm. The second one is long execution time of a single bolt (20-30 min) due mainly to the large size of an individual compute unit (the file).
By adjusting the bolt heartbeat timeout and the expected ACK timeout in the spout, I can convince Storm to get the files processed. Of course, I need to take care of data locality of which the spout must be aware. On each worker (1 per machine) a spout would fire only touples with local input files. So in a sense, I mimic the functionality of HDFS.
There will be a time when the input files will decrease in size and increase in number, but until the algorithm developers write code with parallel distribution in mind, this Storm solution with large input files will have to do.
Or does it? The advantage of the solution is that is downscales great to single machine environment and scales up fantastic (as long as I keep an eye on data locality). The disadvantage is that I am right now using Storm "just" as a realtime job pipeline scheduler with relatively small input data count (50 -1000).
Are there better solutions for this specific setup of should I just be happy with what works? With this "misuse" will I run into unexpected problems?
Thanks for reading so much.
- Daniel