You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Sachin Aggarwal <di...@gmail.com> on 2016/06/21 10:19:42 UTC

Structured Streaming partition logic with respect to storage and fileformat

when we use readStream to read data as Stream, how spark decides the no of
RDD and partition within each RDD with respect to storage and file format.

val dsJson = sqlContext.readStream.json("/Users/sachin/testSpark/inputJson")

val dsCsv = sqlContext.readStream.option("header","true").csv(
"/Users/sachin/testSpark/inputCsv")

val ds = sqlContext.readStream.text("/Users/sachin/testSpark/inputText")
val dsText = ds.as[String].map(x =>(x.split(" ")(0),x.split("
")(1))).toDF("name","age")

val dsParquet =
sqlContext.readStream.format("parquet").parquet("/Users/sachin/testSpark/inputParquet")



-- 

Thanks & Regards

Sachin Aggarwal
7760502772

Re: Structured Streaming partition logic with respect to storage and fileformat

Posted by Sachin Aggarwal <di...@gmail.com>.

what will the scenario in case of s3 and  local file system?

On Tue, Jun 21, 2016 at 4:36 PM, Jörn Franke <jo...@gmail.com> wrote:

> Based on the underlying Hadoop FileFormat. This one does it mostly based
> on blocksize. You can change this though.
>
> On 21 Jun 2016, at 12:19, Sachin Aggarwal <di...@gmail.com>
> wrote:
>
>
> when we use readStream to read data as Stream, how spark decides the no of
> RDD and partition within each RDD with respect to storage and file format.
>
> val dsJson = sqlContext.readStream.json(
> "/Users/sachin/testSpark/inputJson")
>
> val dsCsv = sqlContext.readStream.option("header","true").csv(
> "/Users/sachin/testSpark/inputCsv")
>
> val ds = sqlContext.readStream.text("/Users/sachin/testSpark/inputText")
> val dsText = ds.as[String].map(x =>(x.split(" ")(0),x.split(" ")(1))).toDF("name","age")
>
> val dsParquet = sqlContext.readStream.format("parquet").parquet("/Users/sachin/testSpark/inputParquet")
>
>
>
> --
>
> Thanks & Regards
>
> Sachin Aggarwal
> 7760502772
>
>


-- 

Thanks & Regards

Sachin Aggarwal
7760502772

Re: Structured Streaming partition logic with respect to storage and fileformat

Posted by Jörn Franke <jo...@gmail.com>.

Based on the underlying Hadoop FileFormat. This one does it mostly based on blocksize. You can change this though.

> On 21 Jun 2016, at 12:19, Sachin Aggarwal <di...@gmail.com> wrote:
> 
> 
> when we use readStream to read data as Stream, how spark decides the no of RDD and partition within each RDD with respect to storage and file format.
> 
> val dsJson = sqlContext.readStream.json("/Users/sachin/testSpark/inputJson")
> 
> val dsCsv = sqlContext.readStream.option("header","true").csv("/Users/sachin/testSpark/inputCsv")
> val ds = sqlContext.readStream.text("/Users/sachin/testSpark/inputText")
> val dsText = ds.as[String].map(x =>(x.split(" ")(0),x.split(" ")(1))).toDF("name","age")
> 
> val dsParquet = sqlContext.readStream.format("parquet").parquet("/Users/sachin/testSpark/inputParquet")
> 
> 
> -- 
> 
> Thanks & Regards
> 
> Sachin Aggarwal
> 7760502772