You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Natu Lauchande <nl...@gmail.com> on 2016/07/26 13:45:27 UTC

Question on set membership / diff sync technique in Spark

Hi,

I am working on a data pipeline in a Spark Streaming app that receives data
as a CSV regularly.

After some enrichment we send the data to another storage layer(ES in the
case). Some of the records in the incoming CSV might be repeated.

I am trying to devise a strategy based on MD5's of the lines to avoid
processing already seen lines , i wonder what would be the best approach
to store this data. I would prefer the data to be located within HDFS
within the same cluster.

I am considering a couple of formats :
- Parquet
- Sequence Files
- Avro
- Apache Arrow (Doesn't sound to have a production version ready yet)

Questions:

1. Is there any alternative approach to avoid re-processing the same rows .

2. Which data storage/technique is more indicated for this kind of set
membership operation.

Any help and thoughts are very much welcome .

Thanks in advance,
Natu