You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by haosdent <gi...@git.apache.org> on 2014/04/06 10:03:59 UTC

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Github user haosdent commented on the pull request:

https://github.com/apache/spark/pull/194#issuecomment-39661707

@marmbrus @pwendell I restart this issue today. After learn the sources relate `SchemaRDD`, I think the better approach is to provide `saveAsHBaseTable(rdd: RDD[Text], ...)` and `saveAsHBaseTable(rdd: SchemaRDD, ...)`both. HBase is quite different from RDBMS. `SchemaRDD` assume every cell in `Row` have `name` and `dataType`. This assumption is OK for Hive or Parquet. But for HBase, this assumption lose some important parts. In HBase, all data are stored in the `Array[Byte]` and don't have `dataType`. And for every cell in HBase, it have rowkey(like index in RDBMS), qualifier(like `name` above) and column family. Column family couldn't be represent in `SchemaRDD`.

So for some user have specific requirements to set column families, we could provide `saveAsHBaseTable(rdd: RDD[Text], ...)` and tell the user how to use it. It provide the max flexibility to use HBase for user. On the other hand, `saveAsHBaseTable(rdd: Schema, ...)` is also necessary for user which have only a column family. We could set a fixed column family in the initialization of `SparkHBaseWriter` to work around the problem above.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---