You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ni...@free.fr on 2015/07/25 16:14:08 UTC

Best practice for transforming and storing from Spark to Mongo/HDFS

Hello,
I am new user of Spark, and need to know what could be the best practice to do the following scenario :

- Spark Streaming receives XML messages from Kafka
- Spark transforms each message of the RDD (xml2json + some enrichments)
- Spark store the transformed/enriched messages inside MongoDB and HDFS (Mongo Key as file name)

Basically, I would say that I have to manage message one by one inside a foreach loop of the RDD and write each message one by one in MongoDB and HDFS.
Do you think it is the best way to dot it ?

Tks
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Best practice for transforming and storing from Spark to Mongo/HDFS

Posted by Cody Koeninger <co...@koeninger.org>.

Use foreachPartition and batch the writes

On Sat, Jul 25, 2015 at 9:14 AM, <ni...@free.fr> wrote:

> Hello,
> I am new user of Spark, and need to know what could be the best practice
> to do the following scenario :
>
> - Spark Streaming receives XML messages from Kafka
> - Spark transforms each message of the RDD (xml2json + some enrichments)
> - Spark store the transformed/enriched messages inside MongoDB and HDFS
> (Mongo Key as file name)
>
> Basically, I would say that I have to manage message one by one inside a
> foreach loop of the RDD and write each message one by one in MongoDB and
> HDFS.
> Do you think it is the best way to dot it ?
>
> Tks
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>