You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@s2graph.apache.org by "DOYUNG YOON (JIRA)" <ji...@apache.org> on 2018/03/15 13:25:00 UTC

[jira] [Commented] (S2GRAPH-183) Provide batch job to dump data stored in HBase into file.

    [ https://issues.apache.org/jira/browse/S2GRAPH-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16400407#comment-16400407 ] 

DOYUNG YOON commented on S2GRAPH-183:
-------------------------------------

After playing around a little bit, find out it would be useful if we can abstract bulk writer and bulk reader as following.
h2. Bulk Writer

Current implementation expect tsv text file format, translate it to hbase's KeyValue, the run LoadIncrementalHFiles.

This is roughly followed these steps.

1. tsv(user provided type) to graph element
2. GraphElement to SKeyValue
3. SKeyValue to HBase's KeyValue
4. Finally, build HFile from KeyValue and run LoadIncrementalHFiles.

Similarly, Bulk Read can be achieved as follow.
h2. Bulk Reader

1. HBase's KeyValue to SKeyValue.
2. SKeyValue to GraphElement.
3. GraphElement to any type(user wanted type).

To abstract these transform, and let user inject their own implementation, I suggest following interfaces.
h2. Interface
{noformat}
trait GraphElementReadable[S] extends Serializable {
  def read(graph: S2Graph)(data: S): Option[GraphElement]
}
{noformat}
{noformat}
trait SKeyValueWritable[T] extends Serializable {
  def write(kv: SKeyValue): T
}
{noformat}
Also it would be useful to provide implicit implementation of above two type class.

One example would be current bulk loader which read single line of tsv string then translate it to GraphElement.
{noformat}
val tsv2GraphElement = new GraphElementReadable[String] {
  override def read(graph: S2Graph)(data: String): Option[GraphElement] = {
    graph.elementBuilder.toGraphElement(data)
  }
}
{noformat}
Note that other variation, such as RDF Statement can be translated to GraphElement by implementing GraphElementReadable type class, then bulk load RDF statement into S2Graph can be done without any other code.
{noformat}
val writerToHBaseKeyValue = new SKeyValueWritable[HKeyValue] {
  override def write(kv: SKeyValue): HKeyValue = {
    new HKeyValue(kv.row, kv.cf, kv.qualifier, kv.timestamp, kv.value)
  }
}
{noformat}
Note that other storage type require different physical class(ex: RocksDB use different class, WriteBatch).

> Provide batch job to dump data stored in HBase into file.
> ---------------------------------------------------------
>
>                 Key: S2GRAPH-183
>                 URL: https://issues.apache.org/jira/browse/S2GRAPH-183
>             Project: S2Graph
>          Issue Type: New Feature
>            Reporter: DOYUNG YOON
>            Assignee: DOYUNG YOON
>            Priority: Major
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Since s2graph provide batch job to read the file and bulk load into HBase, it also would be helpful to provide batch job to dump data stored in HBase into the file.
> I think once we have the dump(deserializer) and loader(serializer), then adding the index on existing data, or change HBase schema version can be achieved by this offline process.
> Also, data migration from external HBase cluster into s2graph HBase cluster can be possible which can be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)