You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by xm_zzc <44...@qq.com> on 2018/01/16 17:38:25 UTC

Should CarbonData need to integrate with Spark Streaming too?

Hi dev:
  Currently CarbonData 1.3(will be released soon) just support to integrate
with Spark Structured Streaming which requires Kafka's version must be >=
0.10. I think there are still many users  integrating Spark Streaming with
kafka 0.8, at least our cluster is, but the cost of upgrading kafka is too
much. So should CarbonData need to integrate with Spark Streaming too?
  
  I think there are two ways to integrate with Spark Streaming, as
following:
  1). CarbonData batch data loading + Auto compaction
  Use CarbonSession.createDataFrame to convert rdd to DataFrame in
InputDStream.foreachRDD, and then save rdd data into CarbonData table which
support auto compaction. In this way, it can support to create pre-aggregate
tables on this main table too (Streaming table does not support to create
pre-aggregate tables on it).
  
  I can test with this way in our QA env and add example to CarbonData.
  
  2). The same as integration with Structured Streaming
  With this way, Structured Streaming append every mini-batch data into
stream segment which is row format, and then when the size of stream segment
is greater than 'carbon.streaming.segment.max.size', it will auto convert
stream segment to batch segment(column format) at the begin of each batch
and create a new stream segment to append data.
  However, I have no idea how to integrate with Spark Streaming yet, *any
suggestion for this*? 



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Should CarbonData need to integrate with Spark Streaming too?

Posted by xm_zzc <44...@qq.com>.

Liang Chen wrote
> Hi
> 
> Thanks for you started this discussion for adding spark streaming support.
> 1. Please try to utilize the current code(structured streaming), not
> adding
> separated logic code for spark streaming. 

[reply] The original idea is to reuse the current code(structured streaming)
to implement integration Spark Streaming.


Liang Chen wrote
> 2. I suggest that by default is using structured streaming , please
> consider
> how to make configuration for enabling/switching to spark streaming.

[reply] The implementations of Structured Streaming and Spark Streaming are
different, the usage of them are different too, I don't understand what dose
'consider 
how to make configuration for enabling/switching to spark streaming' mean?
IMO, we just need to implement a utilities to write rdd data to streaming
segment in DStream.foreachRDD, the logic of this utilities is the same as
CarbonAppendableStreamSink.addBatch. right?


Liang Chen wrote
> Regards
> Liang
> 
> 
> xm_zzc wrote
>> Hi dev:
>>   Currently CarbonData 1.3(will be released soon) just support to
>> integrate
>> with Spark Structured Streaming which requires Kafka's version must be >=
>> 0.10. I think there are still many users  integrating Spark Streaming
>> with
>> kafka 0.8, at least our cluster is, but the cost of upgrading kafka is
>> too
>> much. So should CarbonData need to integrate with Spark Streaming too?
>>   
>>   I think there are two ways to integrate with Spark Streaming, as
>> following:
>>   1). CarbonData batch data loading + Auto compaction
>>   Use CarbonSession.createDataFrame to convert rdd to DataFrame in
>> InputDStream.foreachRDD, and then save rdd data into CarbonData table
>> which
>> support auto compaction. In this way, it can support to create
>> pre-aggregate
>> tables on this main table too (Streaming table does not support to create
>> pre-aggregate tables on it).
>>   
>>   I can test with this way in our QA env and add example to CarbonData.
>>   
>>   2). The same as integration with Structured Streaming
>>   With this way, Structured Streaming append every mini-batch data into
>> stream segment which is row format, and then when the size of stream
>> segment
>> is greater than 'carbon.streaming.segment.max.size', it will auto convert
>> stream segment to batch segment(column format) at the begin of each batch
>> and create a new stream segment to append data.
>>   However, I have no idea how to integrate with Spark Streaming yet, *any
>> suggestion for this*? 
>> 
>> 
>> 
>> --
>> Sent from:
>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> 
> 
> 
> 
> 
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/





--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Should CarbonData need to integrate with Spark Streaming too?

Posted by Liang Chen <ch...@gmail.com>.

Hi

Thanks for you started this discussion for adding spark streaming support.
1. Please try to utilize the current code(structured streaming), not adding
separated logic code for spark streaming. 
2. I suggest that by default is using structured streaming , please consider
how to make configuration for enabling/switching to spark streaming.

Regards
Liang


xm_zzc wrote
> Hi dev:
>   Currently CarbonData 1.3(will be released soon) just support to
> integrate
> with Spark Structured Streaming which requires Kafka's version must be >=
> 0.10. I think there are still many users  integrating Spark Streaming with
> kafka 0.8, at least our cluster is, but the cost of upgrading kafka is too
> much. So should CarbonData need to integrate with Spark Streaming too?
>   
>   I think there are two ways to integrate with Spark Streaming, as
> following:
>   1). CarbonData batch data loading + Auto compaction
>   Use CarbonSession.createDataFrame to convert rdd to DataFrame in
> InputDStream.foreachRDD, and then save rdd data into CarbonData table
> which
> support auto compaction. In this way, it can support to create
> pre-aggregate
> tables on this main table too (Streaming table does not support to create
> pre-aggregate tables on it).
>   
>   I can test with this way in our QA env and add example to CarbonData.
>   
>   2). The same as integration with Structured Streaming
>   With this way, Structured Streaming append every mini-batch data into
> stream segment which is row format, and then when the size of stream
> segment
> is greater than 'carbon.streaming.segment.max.size', it will auto convert
> stream segment to batch segment(column format) at the begin of each batch
> and create a new stream segment to append data.
>   However, I have no idea how to integrate with Spark Streaming yet, *any
> suggestion for this*? 
> 
> 
> 
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/





--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Should CarbonData need to integrate with Spark Streaming too?

Posted by xm_zzc <44...@qq.com>.

Hi Jacky:
>>  1). CarbonData batch data loading + Auto compaction 
>>  Use CarbonSession.createDataFrame to convert rdd to DataFrame in 
>> InputDStream.foreachRDD, and then save rdd data into CarbonData table
>> which 
>> support auto compaction. In this way, it can support to create
>> pre-aggregate 
>> tables on this main table too (Streaming table does not support to create 
>> pre-aggregate tables on it). 
>> 
>>  I can test with this way in our QA env and add example to CarbonData.
>
>This approach is doable, but the loading interval should be relative longer
since it still uses columnar file in >this approach. I am not sure how
frequent you do one batch load? 

Agree. the loading interval should be relative longer, maybe 15s, 30s, even
1min, but it is also related to the data size of every mini-batch.

>>  2). The same as integration with Structured Streaming 
>>  With this way, Structured Streaming append every mini-batch data into 
>> stream segment which is row format, and then when the size of stream
>> segment 
>> is greater than 'carbon.streaming.segment.max.size', it will auto convert 
>> stream segment to batch segment(column format) at the begin of each batch 
>> and create a new stream segment to append data. 
>>  However, I have no idea how to integrate with Spark Streaming yet, *any 
>> suggestion for this*? 
>>
>
>You can refer to the logic in CarbonAppendableStreamSink.addBatch,
basically it launches a job to do >appending to row format files in the
streaming segment by invoking >CarbonAppendableStreamSink.writeDataFileJob.
At beginning, you can invoke checkOrHandOffSegment >to create the streaming
segment. 
>I think integrate with the SparkStreaming is a good feature to have, it
enables more user to use carbon >streaming ingest feature on existing
cluster setting with old spark and Kafka version. 
>Please feel free to create JIRA ticket and discuss in the community. 

OK, I have read the code of streaming module , and discussed with David
offline, I will implement this feature ASAP.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Should CarbonData need to integrate with Spark Streaming too?

Posted by Jacky Li <ja...@qq.com>.


> 在 2018年1月17日，上午1:38，xm_zzc <44...@qq.com> 写道：
> 
> Hi dev:
>  Currently CarbonData 1.3(will be released soon) just support to integrate
> with Spark Structured Streaming which requires Kafka's version must be >=
> 0.10. I think there are still many users  integrating Spark Streaming with
> kafka 0.8, at least our cluster is, but the cost of upgrading kafka is too
> much. So should CarbonData need to integrate with Spark Streaming too?
> 
>  I think there are two ways to integrate with Spark Streaming, as
> following:
>  1). CarbonData batch data loading + Auto compaction
>  Use CarbonSession.createDataFrame to convert rdd to DataFrame in
> InputDStream.foreachRDD, and then save rdd data into CarbonData table which
> support auto compaction. In this way, it can support to create pre-aggregate
> tables on this main table too (Streaming table does not support to create
> pre-aggregate tables on it).
> 
>  I can test with this way in our QA env and add example to CarbonData.

This approach is doable, but the loading interval should be relative longer since it still uses columnar file in this approach. I am not sure how frequent you do one batch load?

> 
>  2). The same as integration with Structured Streaming
>  With this way, Structured Streaming append every mini-batch data into
> stream segment which is row format, and then when the size of stream segment
> is greater than 'carbon.streaming.segment.max.size', it will auto convert
> stream segment to batch segment(column format) at the begin of each batch
> and create a new stream segment to append data.
>  However, I have no idea how to integrate with Spark Streaming yet, *any
> suggestion for this*? 
> 

You can refer to the logic in CarbonAppendableStreamSink.addBatch, basically it launches a job to do appending to row format files in the streaming segment by invoking CarbonAppendableStreamSink.writeDataFileJob. At beginning, you can invoke checkOrHandOffSegment to create the streaming segment.
I think integrate with the SparkStreaming is a good feature to have, it enables more user to use carbon streaming ingest feature on existing cluster setting with old spark and Kafka version.
Please feel free to create JIRA ticket and discuss in the community.

> 
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Should CarbonData need to integrate with Spark Streaming too?

Posted by Kumar Vishal <ku...@gmail.com>.

+1
Regards
Kumar Vishal

On Thu, Jan 18, 2018 at 12:13 PM, jarray <ja...@163.com> wrote:

> +1，should do it
>
>
>
>
>
>
> On 01/18/2018 12:04, David CaiQiang wrote:
> +1  for 2). The same as integration with Structured Streaming
>
>
>
> -----
> Best Regards
> David Cai
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>

Re: Should CarbonData need to integrate with Spark Streaming too?

Posted by jarray <ja...@163.com>.

+1，should do it






On 01/18/2018 12:04, David CaiQiang wrote:
+1  for 2). The same as integration with Structured Streaming



-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Should CarbonData need to integrate with Spark Streaming too?

Posted by David CaiQiang <da...@gmail.com>.

+1  for 2). The same as integration with Structured Streaming 



-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/