You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@s2graph.apache.org by "Chul Kang (JIRA)" <ji...@apache.org> on 2018/03/22 01:12:00 UTC
[jira] [Updated] (S2GRAPH-185) Support Spark Structured Streaming
to work with data in streaming and batch
[ https://issues.apache.org/jira/browse/S2GRAPH-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chul Kang updated S2GRAPH-185:
------------------------------
Description:
By default, S2Graph will publish all edge/vertex requests to the Kafka in WAL format.
In Kakao, S2Graph has been used as a master database to store all user's activities,
I have been developing several ETL jobs that are suitable for these use-cases, and I want to contribute them.
Use cases are as follows,
{code:java}
edge/vertex incoming through the Kafka save to other storages
- druid sink for slice and dice
- es sink for search
- file sink for store edge/vertex
ingest from various storage to s2graph
- MySQL binlog
- hdfs/hive/hbase
ETL job on edge/vertex data
- merge all user activities based on userId.
- generate statistical information
- apply ML library on graph data format
{code}
Below are some simple requirements for this,
* supports both streaming/static source data processing
* computation flow is re-usable and sharing on streaming and batch
* operate by simple job description
Spark Structured Streaming supports unified API for both streaming and batch by using Dataframe/Dataset API from SparkSQL.
It allows the same operations to be executed on bounded/unbounded data sources and guarantees exactly-once fault-tolerance.
Structured streaming provides several DataSource and Sink, and it supports the implementation of the Source/Sink interface.
Using this, we can easily develop ETL Job that can be linked to various repositories.
Reference: [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
was:
By default, S2Graph will publish all edge/vertex requests to the Kafka in WAL format.
In Kakao, S2Graph has been used as a master database to store all user's activities,
I have been developing several ETL jobs that are suitable for these use-cases, and I want to contribute them.
Use cases are as follows,
{code:java}
edge/vertex incoming through the Kafka save to other storages
- druid sink for slice and dice
- es sink for search
- file sink for store edge/vertex
ingest from various storage to s2graph
- MySQL binlog
- hdfs/hive/hbase
ETL job on edge/vertex data
- merge all user activities based on userId.
- generate statistical information
- apply ML library on graph data format
{code}
Below are some simple requirements for this,
* supports both streaming/static source data processing
* computation flow is re-usable and sharing on streaming and batch
* operate by simple job description
Spark Structured Streaming supports unified API for both streaming and batch by using Dataframe/Dataset API from SparkSQL.
It allows the same operations to be executed on bounded/unbounded data sources and guarantees exactly-once fault-tolerance.
Structured streaming provides several DataSource and Sink,
and it supports the implementation of the Source/Sink interface.
Using this, we can easily develop ETL Job that can be linked to various repositories.
Reference: [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
> Support Spark Structured Streaming to work with data in streaming and batch
> ---------------------------------------------------------------------------
>
> Key: S2GRAPH-185
> URL: https://issues.apache.org/jira/browse/S2GRAPH-185
> Project: S2Graph
> Issue Type: New Feature
> Reporter: Chul Kang
> Priority: Major
>
> By default, S2Graph will publish all edge/vertex requests to the Kafka in WAL format.
> In Kakao, S2Graph has been used as a master database to store all user's activities,
> I have been developing several ETL jobs that are suitable for these use-cases, and I want to contribute them.
> Use cases are as follows,
> {code:java}
> edge/vertex incoming through the Kafka save to other storages
> - druid sink for slice and dice
> - es sink for search
> - file sink for store edge/vertex
> ingest from various storage to s2graph
> - MySQL binlog
> - hdfs/hive/hbase
> ETL job on edge/vertex data
> - merge all user activities based on userId.
> - generate statistical information
> - apply ML library on graph data format
> {code}
>
> Below are some simple requirements for this,
> * supports both streaming/static source data processing
> * computation flow is re-usable and sharing on streaming and batch
> * operate by simple job description
>
> Spark Structured Streaming supports unified API for both streaming and batch by using Dataframe/Dataset API from SparkSQL.
> It allows the same operations to be executed on bounded/unbounded data sources and guarantees exactly-once fault-tolerance.
> Structured streaming provides several DataSource and Sink, and it supports the implementation of the Source/Sink interface.
> Using this, we can easily develop ETL Job that can be linked to various repositories.
>
> Reference: [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)