You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@s2graph.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/12/31 05:17:00 UTC

[jira] [Commented] (S2GRAPH-252) Improve performance of S2GraphSource

    [ https://issues.apache.org/jira/browse/S2GRAPH-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731175#comment-16731175 ] 

ASF GitHub Bot commented on S2GRAPH-252:
----------------------------------------

GitHub user SteamShon opened a pull request:

    https://github.com/apache/incubator-s2graph/pull/195

    [S2GRAPH-252]: Improve performance of S2GraphSource

    - add SchemaManager.
    - add SerializeUtil/DeserializeUtil .
    - refactor S2GraphSink/S2GraphSource to use SerializeUtil/DeserializeUtil.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/SteamShon/incubator-s2graph S2GRAPH-252

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-s2graph/pull/195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #195
    
----
commit 7b2fd3576a88c0ee5a1c83a39fe451960bebcab9
Author: DO YUNG YOON <st...@...>
Date:   2018-12-25T11:39:55Z

    add HFileParserUDF.

commit 23793d47f1102fc23f7a07f4b2ac53d45e45e0ef
Author: DO YUNG YOON <st...@...>
Date:   2018-12-26T02:00:30Z

    add LabelSchema.

commit b7e58f6dcee79c634126f7f9cf60caf42719832c
Author: DO YUNG YOON <st...@...>
Date:   2018-12-26T02:29:50Z

    change S2GraphSource to use DeserializeUtil directly on Result.

commit dfa76a9d0d0d5d932c7a2a9bcfa43c77485adaae
Author: DO YUNG YOON <st...@...>
Date:   2018-12-26T03:56:47Z

    add error handling.

commit 33388de1732beaea8375dee8db74a2c4f619603e
Author: DO YUNG YOON <st...@...>
Date:   2018-12-27T04:55:36Z

    directly deserialize cell.

commit 2942e42eb00e5f9fa19ecf23319d615df3a2e87a
Author: DO YUNG YOON <st...@...>
Date:   2018-12-29T00:47:41Z

    tmp.

commit 952cdf68480eb2e1c1a6292102555ea5bdee7d46
Author: DO YUNG YOON <st...@...>
Date:   2018-12-29T11:31:07Z

    add DeserializeSchema/SerializeSchema.

commit c750774202390f00a58aa279b7cfe2245248699f
Author: DO YUNG YOON <st...@...>
Date:   2018-12-30T05:59:47Z

    merge DeserializeSchema and SerializeSchema to SchemaManager.

commit 420c0bce3011fc89d4ad08a58166470681d308d7
Author: DO YUNG YOON <st...@...>
Date:   2018-12-31T03:09:23Z

    bug fix on wide/tall schema on Vertex/SerializeUtil.

commit 48b1594417f476256383811405bae8728f3e1780
Author: DO YUNG YOON <st...@...>
Date:   2018-12-31T03:23:52Z

    Need to pass right spark sql schema on createDataFrame.

commit bf5cfe61ba714cf5266183d48074a3a954c76536
Author: DO YUNG YOON <st...@...>
Date:   2018-12-31T03:35:41Z

    support vertex in S2GraphSource.

commit b8ecf23908ea10b264f76ebb257a5efe084a93d5
Author: DO YUNG YOON <st...@...>
Date:   2018-12-31T05:13:35Z

    refactor S2GraphSink bulkload to use SchemaManager to build RDD[KeyValue].

----


> Improve performance of S2GraphSource 
> -------------------------------------
>
>                 Key: S2GRAPH-252
>                 URL: https://issues.apache.org/jira/browse/S2GRAPH-252
>             Project: S2Graph
>          Issue Type: Improvement
>          Components: s2jobs
>            Reporter: DOYUNG YOON
>            Assignee: DOYUNG YOON
>            Priority: Major
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> S2GraphSource is responsible to translate HBASE snapshot(*TableSnapshotInputFormat*) to graph element such as edge/vertex.
> below code create *RDD[(ImmutableBytesWritable, Result)]* from *TableSnapshotInputFormat*
> {noformat}
> val rdd = ss.sparkContext.newAPIHadoopRDD(job.getConfiguration,
>         classOf[TableSnapshotInputFormat],
>         classOf[ImmutableBytesWritable],
>         classOf[Result])
> {noformat}
> The problem comes after obtaining RDD. 
> Current implementation use *RDD.mapPartitions* because S2Graph class is not serializable, mostly because it has Asynchbase client in it.
> The problematic part is the following.
> {noformat}
> val elements = input.mapPartitions { iter =>
>       val s2 = S2GraphHelper.getS2Graph(config)
>       iter.flatMap { line =>
>         reader.read(s2)(line)
>       }
>     }
>     val kvs = elements.mapPartitions { iter =>
>       val s2 = S2GraphHelper.getS2Graph(config)
>       iter.map(writer.write(s2)(_))
>     }
> {noformat}
> On each RDD partition, S2Graph instance connect meta storage, such as mysql, and use the local cache to avoid heavy read from meta storage.
> Even though it works with a dataset with the small partition, the scalability of S2GraphSource limited by the number of partitions, which need to be increased when dealing with large data.
> Possible improvement can be achieved by not depending on meta storage when it deserializes HBase's Result class into Edge/Vertex. 
> We can simply achieve this by loading all necessary schemas from meta storage on spark driver, then broadcast these schemas and use them to deserialize instead of connecting meta storage on each partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)