You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aravindh <ma...@aravindh.io> on 2016/09/28 04:09:04 UTC

Help required in validating an architecture using Structured Streaming

Hi, We are building an internal analytics application. Kind of an event
store. We have all the basic analytics use cases like filtering,
aggregation, segmentation etc. So far our architecture used ElasticSearch
extensively but that is not scaling anymore. One unique requirement we have
is an event should be available for querying within 5 seconds of the event.
We were thinking of a lambda architecture where streaming data still goes to
elastic search (only 1 day's data), batch pipeline goes to s3. Every day
one, a spark job will transform that data and store again in s3. One problem
we were not able to solve was when a query comes, how to aggregate results
from 2 data sources (ES for current data & s3 for old data). We felt this
approach wont scale.

Spark Structured Streaming seems to solve this. Correct me if i am wrong.
With structured streaming, will the following architecture work?
Read data from kafka using spark. For every batch of data, do the
transformations and store in s3. But when a query comes, query from both s3
& in memory batch at the same time. Will this approach work? Also one more
condition is, querying should respond immediately. With a max latency of 1s
for simple queries and 5s for complex queries. If the above method is not
the right way, please suggest an alternative to solve this.

Thanks
Aravindh.S

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-in-validating-an-architecture-using-Structured-Streaming-tp27801.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org