You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@s2graph.apache.org by "Ivan, Chen Penghe" <pe...@adsc.com.sg> on 2016/06/27 06:58:46 UTC

Inquiry -- S2Graph vs Titan

Dear Madam/Sir,

This is Ivan, from Singapore.

I am quite interested in using S2Graph to store graph data in our
project, so I want to check with you, in the distributed perspective,
how the mechanisms of S2Graph is different from Titan. From the FAQ
page, I can see the difference between S2Graph and Titan upon
performance. However, I did not find the information regarding��how
S2Graph handles graph partitioning to realize this distributed
management.

Many thanks!
##SELECTION_END##

Regards,
Ivan

Re: Inquiry -- S2Graph vs Titan

Posted by DO YUNG YOON <sh...@gmail.com>.

Hi Ivan.

Thanks for sharing details and questions.

First let's talk about "Distributed" first.

S2Graph is very similar to Titan for Distributed nature(I used to be the
user of Titan, before I started S2Graph, so many parts should be similar).

In more detail, Titan can choose Storage layer mostly Cassandra, HBase.
S2Graph also support this feature, even though currently available option
is only HBase.

Choosing HBase as storage in S2Graph make it to act as graph layer atop of
HBase.
It provide layer that accept logical Vertex/Edge and store them into HBase
using random partitioning. As long as your data can fit into your HBase
cluster, then you don't have to worry about scalability for data size.

About splitting data, S2Graph ask user how many pre-split your HBase table
when you define "Label", which is the schema for edges.

Let's say you defined your pre-split as 2, this means your HBase
table(where your vertices and edges will be stored) has 2 partitions and
each partition is responsible for Int.MaxValue / 2 logically. This 2 is
only for minimum partitions you start with. once more data goes in, then
HBase automatically split partitions(default it split any partition that
becomes to big as two and pre-split range is used as first level of this
hierarchy).

Ex) pre-split = 2. HBase has following bytes range(startKey ~ stopKey)
partition.

\x19\x99\x99\x99 3332 : partition 0
3332 L\xCC\xCC\xCB : partition 1

Let's say user create (user_id_1 -> created -> tweet_id_999) edge.
Then S2Graph decide wright partition based on value from
Bytes.toBytes(MurmurHash("user_id_1").toInt).

When more data goes in partition 0 and HBase split partition 0 as following.

\x19\x99\x99\x99 1132\xdf : partition 0
1132\xdf  3332 : partition 0-1
3332 L\xCC\xCC\xCB : partition 1

Even data grows, S2Graph itself behave same but HBase take responsibility
to manage these partitions.

Simply S2Graph try to split different source vertex and it's adjacent edges
evenly on every partition as much as possible.

I believe this is similar to what Titan does with HBase Storage.

All data is stored in HBase, S2Graph behave as Thin layer atop of HBase.

About indexing performance, please refer "Performance" on
http://schd.ws/hosted_files/apachebigdata2016/03/S2Graph-%20Apache%20Big%20Data.pdf

Indexing performance is mainly dominated by pre-split size and your HBase
cluster setup.

One nice thing about S2Graph is S2Graph server is stateless server. it
store all necessary data and locks(only when contention happen) on HBase,
so if only one S2Graph server is not performant enough, then make it more.
write/query throughput on S2Graph's side is linear scalable.

Secondly, Let's talk about query types.

You are correct that S2Graph is mainly targeted for vertex based
query(OLTP).

Finding all edges with a specific type require different type of
operation(OLAP, it require full scan).

I believe you can run not only OLTP type but also OLAP type query in Titan.
In production, full scan on large graph becomes problematic. S2Graph do not
support OLAP type query itself since it want to be production ready OLTP
graph database.

Here is how I have been used S2Graph for OLAP type queries.

Since S2Graph publish all incoming data into system into Apache Kafka, it
is easy to store all data into Hive tables in HDFS. S2Graph provide
"loader" project that actually does this. loader ask user which Kafka
topics to consume and load all data in Kafka topics into hive table.

We run OLAP queries using Hive and Spark SQL and find out right sets.
We even run Spark's MLlib to build datasets that inferred from long-term
history.

Then we can bulk load these "found out and useful relations" into S2Graph
using loader project.

OLTP client can consume this without any changes.

We also use elasticsearch to find out starting vertex that meets certain
criteria, then run vertex centric queries on S2Graph to find out related
graph.

In dev-phase, then It would be possible to indexing as follow.
(user_id_1 -> created -> tweet_id_999, "props": {"created_at": "20160630"})
then explicitly create following auxiliary edges.
(created -> has -> toString(user_id_1, tweet_id_999))
then query on "what is all edges with label name created".

Simply, S2Graph avoid run OLAP queries on same HBase cluster that accept
lots of concurrent OLTP requests. It try to take advantages of other
projects, which is more mature. I know it require many more components, but
I believe they are absolutely worth to try in production.

S2Graph can automate above process in future and I think it should be on
our roadmap, but now user have to be setup all of above components.

Third, about vertex property values.

Like you said, store some property on other system limit search/traversal
graph based on certain property.

S2Graph only support String, Long, Double types for property value
currently, with "In", "Not In", "=", "!=", ">=", ">", "<=", "<" operators.
we are not currently support string operation like "include", "like" yet,
but I think it is fairly straight forward.

So my suggestion would be keep property value that will be heavily used on
traversal in the graph.

I guess it depends on what type of your traversal would be. If operation on
property value while traverser is simple, then keeping it inside of graph
is much much faster than lookup other system.

Once again, I really appreciate your questions. If you have any more
questions, then I would like to hear them.

Best Regards.
DOYUNG YOON

On Wed, Jun 29, 2016 at 11:21 AM Ivan, Chen Penghe <pe...@adsc.com.sg>
wrote:

> Dear Mr. Do Yung Yoon,
>
> First of all, thank you very much for the clarification and it is very
> helpful.
>
> Regarding our usage, we actually have some twitter user profile data as
> well as their timeline tweets, so we want to build a graph of these users
> and tweets. Graph links can happen between users, between tweets, or
> between users and tweets. In other words, we somehow rebuilt the twitter
> network in a small subset.
> Our problem is that, we cannot put all the data in one single server, so
> we want to find a distributed graph database so that data can be split into
> different machines automatically. (In fact, random partition should be good
> enough for our case and we do not need any customized partitioning
> strategies. ID should be a problem as both user and tweet has its unique
> ID.)
>
> In fact, I also considered OrientDB, but OrientDB needs to split data
> explicitly by developer. Then I have to reconsider Titan which I know is
> designed for distributed graph database. Titan seems very attractive, but
> the its future development is very uncertain. Then I luckily found this
> S2Graph project which is also designed for distributed graph database, and
> think it may be a better choice. However, there is not very much
> information about in the Internet, so I want to know how S2Graph is
> different from Titan in terms of the underlying design and mechanisms.
> In addition, from the Git Book, it seems S2Graph is mainly targeted for
> vertex based query, like we want to know the neighbors (or neighbors of
> neighbors) of a given vertext. The three members -- the Index Edge, the
> Snapshot Edge, and the Vertex, are absolutely helpful in handling such
> kinds of queries, but how about other types queries, like I want to find
> all edges with a specific type? Another question is  that, it seems there
> are S2Graph Server as well as HBase Server. So, may I say HBase servers
> store the actual data, while S2Graph is mainly for indexing? If so, how
> many S2Graph server do we need, only one or we can have more S2Graph
> servers? If more S2Graph Servers, the indexing partitioning can be done
> automatically?
>
> Furthermore, in fact, our graph size is not too large to be handled by a
> single machine, but the entire dataset including the vertex property values
> (like tweet text for a tweet node) makes our data cannot feed into a single
> machine. One possible choice is to put all of those property value data
> into a separate storage like MongoDB, and use another graph database like
> Neo4j to build the graph topology. However, this method may not work if we
> want to search/traversal graph based on certain property values. Hence, may
> I know your suggestions/comments on handling such situation?
>
> Really really thank you very much for the help!
>
>
> Regards,
> Ivan
>
> On Tue, 2016-06-28 at 09:04 +0800, DO YUNG YOON wrote:
>
> Hi Ivan!
>
> Thanks for asking this question so we can have disccusion on this topic.
>
> I guess you are asking about graph partitioning like
> http://s3.thinkaurelius.com/docs/titan/1.0.0/graph-partitioning.html(titan)
> .
>
> In short, S2Graph only provide random partition currently, but I think
> provide option for custom partitioner(in our case how to create murmur hash
> value of source vertex since we are prepend murmur has bytes in front of
> rowKey) would be pretty straight forward.
>
> Details follow.
>
> S2Graph force vertex centric index when user define label, which is schema
> for relationship. vertex centric indexes address query performance for
> large degree vertices. I believe this is same what titan provide.
>
> However, write hotspot can be problematic with very popular vertex.
>
> One thing to note here is that S2Graph expect user to manage their vertex
> id on different system.
> Usually users want to keep their old existing system(say RDBMS for user
> table), but want to take advantage of S2Graph for large data(usually
> relationship). S2Graph support custom Id with string(user can provide any
> string as vertex's id) so write hotspot can be avoided as following
> eventhough it is upto user's responsibility to provide right partition key.
>
> For example, justin Bieber <- followed by <- [1, 3, 19894, 384, ..... ],
> and says size of adjacent list that comming into justin biebar is million.
>
> We can partition user_ids by 10 and change user_id as composite id, 1_0,
> 3_0, 19894_1989, so on.
>
> I think the reason we have not discussed this yet is because S2Graph
> currently focus on OLTP usecases and in OLTP, I think explicit partitioning
> can be problematic(This is based only on my experience so high possibility
> to be wrong).
>
> Partition datas that usually accessed together can be hotspot for query.
> to provide high throughput on query, S2Graph want to spread all request
> evenly as much as possible.
>
> I am very interested about your usecase so we can discuss futher on this
> topic. Can you provide little more details on your data and access patterns?
>
> On Mon, Jun 27, 2016 at 3:59 PM Ivan, Chen Penghe <pe...@adsc.com.sg>
> wrote:
>
> Dear Madam/Sir,
>
> This is Ivan, from Singapore.
>
> I am quite interested in using S2Graph to store graph data in our project,
> so I want to check with you, in the distributed perspective, how the
> mechanisms of S2Graph is different from Titan. From the FAQ page, I can see
> the difference between S2Graph and Titan upon performance. However, I did
> not find the information regarding  how S2Graph handles graph partitioning
> to realize this distributed management.
>
> Many thanks!
>
>
> Regards,
> Ivan
> !
>
>

Re: Inquiry -- S2Graph vs Titan

Posted by DO YUNG YOON <sh...@gmail.com>.

Hi Ivan!

Thanks for asking this question so we can have disccusion on this topic.

I guess you are asking about graph partitioning like
http://s3.thinkaurelius.com/docs/titan/1.0.0/graph-partitioning.html(titan).

In short, S2Graph only provide random partition currently, but I think
provide option for custom partitioner(in our case how to create murmur hash
value of source vertex since we are prepend murmur has bytes in front of
rowKey) would be pretty straight forward.

Details follow.

S2Graph force vertex centric index when user define label, which is schema
for relationship. vertex centric indexes address query performance for
large degree vertices. I believe this is same what titan provide.

However, write hotspot can be problematic with very popular vertex.

One thing to note here is that S2Graph expect user to manage their vertex
id on different system.
Usually users want to keep their old existing system(say RDBMS for user
table), but want to take advantage of S2Graph for large data(usually
relationship). S2Graph support custom Id with string(user can provide any
string as vertex's id) so write hotspot can be avoided as following
eventhough it is upto user's responsibility to provide right partition key.

For example, justin Bieber <- followed by <- [1, 3, 19894, 384, ..... ],
and says size of adjacent list that comming into justin biebar is million.

We can partition user_ids by 10 and change user_id as composite id, 1_0,
3_0, 19894_1989, so on.

I think the reason we have not discussed this yet is because S2Graph
currently focus on OLTP usecases and in OLTP, I think explicit partitioning
can be problematic(This is based only on my experience so high possibility
to be wrong).

Partition datas that usually accessed together can be hotspot for query. to
provide high throughput on query, S2Graph want to spread all request evenly
as much as possible.

I am very interested about your usecase so we can discuss futher on this
topic. Can you provide little more details on your data and access patterns?

On Mon, Jun 27, 2016 at 3:59 PM Ivan, Chen Penghe <pe...@adsc.com.sg>
wrote:

> Dear Madam/Sir,
>
> This is Ivan, from Singapore.
>
> I am quite interested in using S2Graph to store graph data in our project,
> so I want to check with you, in the distributed perspective, how the
> mechanisms of S2Graph is different from Titan. From the FAQ page, I can see
> the difference between S2Graph and Titan upon performance. However, I did
> not find the information regarding  how S2Graph handles graph partitioning
> to realize this distributed management.
>
> Many thanks!
>
>
> Regards,
> Ivan
> !