You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@s2graph.apache.org by DO YUNG YOON <sh...@gmail.com> on 2018/12/31 23:57:10 UTC

[DISCUSS]: implementation of meta storage(schema)

Hi folks.

I want to discuss the current implementation of meta storage(schema).

The roles of the schema in S2Graph are following.

1. When it accepts write request that is represented by logical
vertex/edge, it uses schema to build a physical internal representation,
which is specific to storage backend. Also, the schema is used to validate
the request.

2. When a query comes in, it uses schema to build physical request, which
is specific to storage backend, then it uses schema to transform physical
representation to logical vertex/edge.

Current implementation assumes that the schema is very small, compared to
actual vertex/edge data. since it read schema a lot so it is important to
build a correct index that supports O(1) for schema is crucial, so it uses
a local cache to increase performance.

The problem with current implementation is that it is impossible to inject
a different implementation of schema since implementations are too tightly
coupled.

In s2jobs, S2GraphSource/S2GraphSink use S2Graph instance to
serialize/deserialize data from HFile, and there is no way to avoid
accessing meta database for schema on each spark executor(details on
https://issues.apache.org/jira/browse/S2GRAPH-252). In this case, a static
schema can be built on spark driver via reading the file or read meta
database or whatever, then broadcast static schema on every spark executor.

In general, I believe what we need is the way to inject a different
implementation of a schema. Currently S2Graph only have the implementation
using meta database with local cache, but it would be great if the
implementation of the schema is abstracted, and finally, a different
implementation can be injected when we create S2Graph instance.

To achieve this, I believe abstracting the necessary methods in one
interface is a good start, so here I collected most of the methods that are
related to the schema.

I suggest to add SchemaManager interface, then refactor current code base
to use this interface to access schema.

I want to discuss if this is the right way and if we need to work on this
first since it will affect lots of codes.

Please feel free to comment.

Here is the draft of the interface.

https://docs.google.com/document/d/134zPVm8vtXMRKC77bsVorp_06zZhU9hk2rQIKXH6HpI/edit?usp=sharing