You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Daniel Schulz <da...@hotmail.com> on 2016/01/26 22:25:34 UTC

Spark Pattern and Anti-Pattern

Hi,
We are currently working on a solution architecture to solve IoT workloads on Spark. Therefore, I am interested in getting to know whether it is considered an Anti-Pattern in Spark to get records from a database and make a ReST call to an external server with that data. This external server may and will be the bottleneck -- but from a Spark point of view: is it possibly harmful to open connections and wait for their responses for vast amounts of rows?
In the same manner: is calling an external library (instead of making a ReST call) for any row possibly problematic?
How to rather embed a C++ library in this workflow: is it best to make a function having a JNI call to run it natively -- iff we know we are single threaded then? Or is there a better way to include C++ code in Spark jobs?
Many thanks in advance.
Kind regards, Daniel.

Re: Spark Pattern and Anti-Pattern

Posted by Jörn Franke <jo...@gmail.com>.

Spark has its best use cases in in-memory batch processing / machine learning. Connecting multiple different sources/destination requires some thinking and probably more than spark.
Connecting spark to a database makes only in very few cases sense. You will have huge performance issues due to the lack of data locality. You have unexpected loads to the database in case of speculative execution or nodes crashing etc
Using rest for transferring a lot of data - again something to be careful with. Rest does not allow to resume transmissions. If the transmission is interrupted after you have transferred 1 tb you have to do transmit everything again. Also rest is format agnostic it is usually used with highly inefficient formats for large files such as json or xml. It is better if you use avro or alike (for exchanges between systems! Not for querying!). In exceptional cases (eg legacy) one or multiple well designed csv are better. In any case please use compression.
What does the rest service do with the data?  Why don't you use sftp + rsync (or duplicity) for resuming transferred files? 

I did not understand your last question. Generally Jni is fine. However you may carefully test memory allocation of your Jni library or go with software containers such as docker or you use cgroups to limit memory and cpu usage of your Jni library. 

However all requires on the details of your use case.

> On 26 Jan 2016, at 22:25, Daniel Schulz <da...@hotmail.com> wrote:
> 
> Hi,
> 
> We are currently working on a solution architecture to solve IoT workloads on Spark. Therefore, I am interested in getting to know whether it is considered an Anti-Pattern in Spark to get records from a database and make a ReST call to an external server with that data. This external server may and will be the bottleneck -- but from a Spark point of view: is it possibly harmful to open connections and wait for their responses for vast amounts of rows?
> 
> In the same manner: is calling an external library (instead of making a ReST call) for any row possibly problematic?
> 
> How to rather embed a C++ library in this workflow: is it best to make a function having a JNI call to run it natively -- iff we know we are single threaded then? Or is there a better way to include C++ code in Spark jobs?
> 
> Many thanks in advance.
> 
> Kind regards, Daniel.

Re: Spark Pattern and Anti-Pattern

Posted by Lars Albertsson <la...@mapflat.com>.

Querying a service or a database from a Spark job is in most cases an
anti-pattern, but there are exceptions. The jobs become unstable and
indeterministic by relying on a live database.

The recommended pattern is to take regular dumps of the database to
your cluster storage, e.g. HDFS, and join the dump dataset with other
datasets, e.g. your incoming events. There are good and bad ways to
dump, however. I covered the topic in this presentation, which you may
find useful: http://www.slideshare.net/lallea/functional-architectural-patterns,
https://vimeo.com/channels/flatmap2015/128468974.

Let me know if you have follow-up questions, or want assistance.

Regards,

Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109

On Tue, Jan 26, 2016 at 10:25 PM, Daniel Schulz
<da...@hotmail.com> wrote:
> Hi,
>
> We are currently working on a solution architecture to solve IoT workloads
> on Spark. Therefore, I am interested in getting to know whether it is
> considered an Anti-Pattern in Spark to get records from a database and make
> a ReST call to an external server with that data. This external server may
> and will be the bottleneck -- but from a Spark point of view: is it possibly
> harmful to open connections and wait for their responses for vast amounts of
> rows?
>
> In the same manner: is calling an external library (instead of making a ReST
> call) for any row possibly problematic?
>
> How to rather embed a C++ library in this workflow: is it best to make a
> function having a JNI call to run it natively -- iff we know we are single
> threaded then? Or is there a better way to include C++ code in Spark jobs?
>
> Many thanks in advance.
>
> Kind regards, Daniel.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org