You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Natu Lauchande <nl...@gmail.com> on 2016/07/26 13:45:27 UTC
Question on set membership / diff sync technique in Spark
Hi,
I am working on a data pipeline in a Spark Streaming app that receives data
as a CSV regularly.
After some enrichment we send the data to another storage layer(ES in the
case). Some of the records in the incoming CSV might be repeated.
I am trying to devise a strategy based on MD5's of the lines to avoid
processing already seen lines , i wonder what would be the best approach
to store this data. I would prefer the data to be located within HDFS
within the same cluster.
I am considering a couple of formats :
- Parquet
- Sequence Files
- Avro
- Apache Arrow (Doesn't sound to have a production version ready yet)
Questions:
1. Is there any alternative approach to avoid re-processing the same rows .
2. Which data storage/technique is more indicated for this kind of set
membership operation.
Any help and thoughts are very much welcome .
Thanks in advance,
Natu