You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by vonnagy <iv...@vadio.com> on 2015/10/18 04:23:06 UTC

Streaming and storing to Google Cloud Storage or S3

I have a streaming application that reads from Kakfa (direct stream) and then
write parquet files. It is a pretty simple app that gets a Kafka direct
stream (8 partitions) and then calls `stream.foreachRdd` and then stores to
parquet using a Dataframe. Batch intervals are set to 10 seconds. During the
storage I use `partitionBy` so the data can be order by time and client.

When running the app and storing the data to a local FS the performance is
somewhat acceptable (~1800 events in 3 seconds). If I point the destination
to Google Cloud storage using the GCS connector the same number of records
takes about 4 minutes. All other things are equal except for the file
destination.

Has anyone tried to go from streaming directly to GCS or S3 and overcome the
unacceptable performance. It can never keep up.

Thanks,

Ivan 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Streaming-and-storing-to-Google-Cloud-Storage-or-S3-tp14664.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Streaming and storing to Google Cloud Storage or S3

Posted by Steve Loughran <st...@hortonworks.com>.
> On 18 Oct 2015, at 03:23, vonnagy <iv...@vadio.com> wrote:
> 
> Has anyone tried to go from streaming directly to GCS or S3 and overcome the
> unacceptable performance. It can never keep up.

the problem here is that they aren't really filesystems (certainly s3 via the s3n & s3a clients), flush() is a no-op, and its's only on the close() that there's a bulk upload. For bonus fun, anything that does a rename() usually forces a download/re-upload of the source files.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org