You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Daniel Sachse <sa...@gmail.com> on 2015/06/03 13:04:33 UTC

Trident and "no-more-data" aggregation

Hey folks,

I am currently evaluating Trident as a replacement for our website analysis tool.
We currently have several components that do: crawling, analyzing, aggregation and reporting. They talk to each other via message queues.

I think that most of our current infrastructure code can be replaced by Storms Trident, but at one point I am unsure if this is possible:
When we crawl a webpage we don´t know how many pages are to be crawled in advance. Once our Crawler does not detect any new pages it fires an aggregation event and we for example check if all subpages have Google Analytics installed. We include several more metrics and send a report.
A simple flowchart: 1 Crawler produces X pages, Analyzer consumes 1 page and produces 1 result, Aggregator consumes X results and produces 1 report, Reporting consumes 1 report and produces 1 enriched report in Y formats.

The critical thing here is the migration of our aggregation system because as far as I understood it is only possible in real-time and not batch-wise. What I would like to know is if there is a way to say: „Do the aggregation once there has not been any new data for 5 minutes or so“.

Is this somehow achievable? Or do you see any other methods I could use? Or is this a wrong use-case for Trident?

Best regards,

Daniel



Re: Trident and "no-more-data" aggregation

Posted by Nikhil Singh <ns...@yahoo.com>.
Hi Daniel,In Trident it is possible to do batch aggregations. If the spout emits X pages for a batch the aggregation can happen on that batch. 
In the example that you have, the spout will keep emitting all the X pages from a website as tuples for a single batch. Once you have no more pages to emit, the spout will signal the completion of batch. 
For that batch then you can do aggregations using a State and persist the values using any storage system. After that the report can be generated.
-Nikhil 


     On Wednesday, June 3, 2015 6:04 AM, Daniel Sachse <sa...@gmail.com> wrote:
   

 Hey folks,
I am currently evaluating Trident as a replacement for our website analysis tool.We currently have several components that do: crawling, analyzing, aggregation and reporting. They talk to each other via message queues.
I think that most of our current infrastructure code can be replaced by Storms Trident, but at one point I am unsure if this is possible:When we crawl a webpage we don´t know how many pages are to be crawled in advance. Once our Crawler does not detect any new pages it fires an aggregation event and we for example check if all subpages have Google Analytics installed. We include several more metrics and send a report.A simple flowchart: 1 Crawler produces X pages, Analyzer consumes 1 page and produces 1 result, Aggregator consumes X results and produces 1 report, Reporting consumes 1 report and produces 1 enriched report in Y formats.
The critical thing here is the migration of our aggregation system because as far as I understood it is only possible in real-time and not batch-wise. What I would like to know is if there is a way to say: „Do the aggregation once there has not been any new data for 5 minutes or so“.
Is this somehow achievable? Or do you see any other methods I could use? Or is this a wrong use-case for Trident?
Best regards,
Daniel