You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhaP524 (JIRA)" <ji...@apache.org> on 2017/09/08 00:28:00 UTC

[jira] [Updated] (SPARK-21948) How to use spark streaming for deal two table from one topic of topic??

     [ https://issues.apache.org/jira/browse/SPARK-21948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhaP524 updated SPARK-21948:
----------------------------
    Attachment: QQ图片20170908080946.png

> How to use spark streaming for deal two table from one topic of topic??
> -----------------------------------------------------------------------
>
>                 Key: SPARK-21948
>                 URL: https://issues.apache.org/jira/browse/SPARK-21948
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Spark Submit, Structured Streaming
>    Affects Versions: 2.1.1
>         Environment: kafka:0.10.0
> Spark:2.1.1
>            Reporter: zhaP524
>         Attachments: QQ图片20170908080946.png
>
>
> Now, I have such A requirement, I want from A topic of kafka、 A group of receiving data, receive to DirectStream contains two tables of data (A<Master>, B<Slave>), my ultimate goal is the two tables of data according to some fields join operation, will produce the results into the Database;
> My previous operations are as follows:
> 1. Use batch processing directly at the DirectStream layer, filter out the data of A and B, produce their respective DF, and then use Spark SQL to perform the join operation, and the results are then entered into the library;But this kind of situation will exist problems, A and B table 1: N relationship, when selected A batch, may lead to A full load, and part B table loaded only, lead to the calculation results is only part of the next batch came in already consumed A data, table B subsequent data has no associated data, leading to loss of data;I don't know if you can store the A table data using the queue and so on, until the total load of B table data is loaded and processed to generate the complete result. In this case, the essence is similar to Spark Streaming Window Operation
> 2、
> 2, the use of Spark Streaming Window Operation for processing, I can isolate A and B table when DirectStream flow stream, but the join Operation shall be carried out in the Window, the Window Operation did not support the transform Operation such as map, lead to process cannot go through(java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord
> Serialization stack:
> 	- object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic = doc, partition = 0,...), 
> I don't know what should I do??
> So, I was wondering if there was a problem with my scene?Or do I have a technical problem?Is there a solution to my business scenario without introducing other components?Trouble to give directions.TKS



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org