You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Igor Berman <ig...@gmail.com> on 2016/07/06 09:27:21 UTC

streaming new data into bigger parquet file

Hi
I was reading following tutorial
https://docs.cloud.databricks.com/docs/latest/databricks_guide/07%20Spark%20Streaming/08%20Write%20Output%20To%20S3.html


of streaming data to s3 of databricks_guide
and it states that sometimes I need to do compaction of small files(e.g.
from spark streaming) into compacted big file(I understand why - better
read performance, to solve "many small files" problem etc)

My questions are:
1. what happens when I have big parquet file partitioned by some field and
I want to append new small files into this big file? Is spark overrides
whole data or it can append the new data at the end?
2. while appending process happens - how can I ensure that readers of big
parquet files are not blocked and won't get any errors?(i.e. are files are
"available" when appending new data to them?)

I will highly appreciate any pointers

thanks in advance,
Igor