You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Sachin Aggarwal <di...@gmail.com> on 2016/06/21 10:19:42 UTC
Structured Streaming partition logic with respect to storage and fileformat
when we use readStream to read data as Stream, how spark decides the no of
RDD and partition within each RDD with respect to storage and file format.
val dsJson = sqlContext.readStream.json("/Users/sachin/testSpark/inputJson")
val dsCsv = sqlContext.readStream.option("header","true").csv(
"/Users/sachin/testSpark/inputCsv")
val ds = sqlContext.readStream.text("/Users/sachin/testSpark/inputText")
val dsText = ds.as[String].map(x =>(x.split(" ")(0),x.split("
")(1))).toDF("name","age")
val dsParquet =
sqlContext.readStream.format("parquet").parquet("/Users/sachin/testSpark/inputParquet")
--
Thanks & Regards
Sachin Aggarwal
7760502772
Re: Structured Streaming partition logic with respect to storage and fileformat
Posted by Sachin Aggarwal <di...@gmail.com>.
what will the scenario in case of s3 and local file system?
On Tue, Jun 21, 2016 at 4:36 PM, Jörn Franke <jo...@gmail.com> wrote:
> Based on the underlying Hadoop FileFormat. This one does it mostly based
> on blocksize. You can change this though.
>
> On 21 Jun 2016, at 12:19, Sachin Aggarwal <di...@gmail.com>
> wrote:
>
>
> when we use readStream to read data as Stream, how spark decides the no of
> RDD and partition within each RDD with respect to storage and file format.
>
> val dsJson = sqlContext.readStream.json(
> "/Users/sachin/testSpark/inputJson")
>
> val dsCsv = sqlContext.readStream.option("header","true").csv(
> "/Users/sachin/testSpark/inputCsv")
>
> val ds = sqlContext.readStream.text("/Users/sachin/testSpark/inputText")
> val dsText = ds.as[String].map(x =>(x.split(" ")(0),x.split(" ")(1))).toDF("name","age")
>
> val dsParquet = sqlContext.readStream.format("parquet").parquet("/Users/sachin/testSpark/inputParquet")
>
>
>
> --
>
> Thanks & Regards
>
> Sachin Aggarwal
> 7760502772
>
>
--
Thanks & Regards
Sachin Aggarwal
7760502772
Re: Structured Streaming partition logic with respect to storage and fileformat
Posted by Jörn Franke <jo...@gmail.com>.
Based on the underlying Hadoop FileFormat. This one does it mostly based on blocksize. You can change this though.
> On 21 Jun 2016, at 12:19, Sachin Aggarwal <di...@gmail.com> wrote:
>
>
> when we use readStream to read data as Stream, how spark decides the no of RDD and partition within each RDD with respect to storage and file format.
>
> val dsJson = sqlContext.readStream.json("/Users/sachin/testSpark/inputJson")
>
> val dsCsv = sqlContext.readStream.option("header","true").csv("/Users/sachin/testSpark/inputCsv")
> val ds = sqlContext.readStream.text("/Users/sachin/testSpark/inputText")
> val dsText = ds.as[String].map(x =>(x.split(" ")(0),x.split(" ")(1))).toDF("name","age")
>
> val dsParquet = sqlContext.readStream.format("parquet").parquet("/Users/sachin/testSpark/inputParquet")
>
>
> --
>
> Thanks & Regards
>
> Sachin Aggarwal
> 7760502772