You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by haosdent <gi...@git.apache.org> on 2014/03/21 10:11:40 UTC

[GitHub] spark pull request: Add spark-hbase.

GitHub user haosdent opened a pull request:

    https://github.com/apache/spark/pull/194

    Add spark-hbase.

    RT

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/haosdent/spark SPARK-1127

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/194.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #194
    
----
commit 4ba30242dad552acd574777a17115435ddeea795
Author: haosdent <ha...@gmail.com>
Date:   2014-03-21T09:09:09Z

    Add spark-hbase.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862643
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/HBaseUtils.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.io.Text
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * A public object that provide HBase supports.
    + * You could save RDD into HBase through [[org.apache.spark.nosql.hbase.HBaseUtils.saveAsHBaseTable]] method.
    + */
    +object HBaseUtils {
    +
    +  /**
    +   * Save [[org.apache.spark.rdd.RDD[Text]]] as a HBase table
    +   * @param rdd [[org.apache.spark.rdd.RDD[Text]]]
    +   * @param zkHost the zookeeper hosts. e.g. "10.232.98.10,10.232.98.11,10.232.98.12"
    +   * @param zkPort the zookeeper client listening port. e.g. "2181"
    +   * @param zkNode the zookeeper znode of HBase. e.g. "hbase-apache"
    +   * @param table the name of table which we save records
    +   * @param rowkeyType the type of rowkey. [[org.apache.spark.nosql.hbase.HBaseType]]
    +   * @param columns the column list. [[org.apache.spark.nosql.hbase.HBaseColumn]]
    +   * @param delimiter the delimiter which used to split record into fields
    +   */
    +  def saveAsHBaseTable(rdd: RDD[Text], zkHost: String, zkPort: String, zkNode: String,
    +                       table: String, rowkeyType: String, columns: List[HBaseColumn], delimiter: Char) {
    +    val conf = new HBaseConf(zkHost, zkPort, zkNode, table, rowkeyType, columns, delimiter)
    +
    +    def writeToHBase(iter: Iterator[Text]) {
    +      val writer = new SparkHBaseWriter(conf)
    +
    +      try {
    +        writer.init()
    +        while (iter.hasNext) {
    +          val record = iter.next()
    +          writer.write(record)
    +        }
    +      } finally {
    +        writer.close()
    +      }
    --- End diff --
    
    No, what you have currently is much better than this.
    So there are two issues here :
    
    1) If init() throws an exception, and we get to finally block; how will writer.close() behave ?
    a) Will it throw an exception ?
    b) Will it ignore ? (I would hope this)
    c) Is it undefined ? In which case we cant rely on this.
    
    2) As part of executing the loop, where we are calling writer.write; it is possible for an exception to be thrown.
    Which gets us to the finally block where we try to close the writer -> which can again throw an IOException (from what I recall about writer's javadoc).
    Now what user will see in the logs will be the second exception - and the reason for that, the first writer.write exception (or some other in the try block) will be lost.
    
    To handle (2), within the finally block, simply wrap the writer.close() in a try/catch and log the message & ignore there.
    To handle (1), we will need to know what the expectation from init() is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-65164456
  
    Thanks for working on this!  I think this will be a really useful addition.  However, with the new external data sources api that is part of 1.2, I think it might be better to do this as an external library (for example: https://github.com/databricks/spark-avro).  This would make it easier to make releases, and also help us keep spark core's size manageable.  If you agree, maybe we can close this issue?  Let me know if you have any questions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38357324
  
    @mridulm @tedyu I update the pull request as your advice. Could you help me review it again? Thank you very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by tedyu <gi...@git.apache.org>.
Github user tedyu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10863208
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/HBaseUtils.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.io.Text
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * A public object that provide HBase supports.
    + * You could save RDD into HBase through [[org.apache.spark.nosql.hbase.HBaseUtils.saveAsHBaseTable]] method.
    + */
    +object HBaseUtils {
    +
    +  /**
    +   * Save [[org.apache.spark.rdd.RDD[Text]]] as a HBase table
    +   *
    +   * The format of record in RDD should looks like this:
    +   *   rowkey|delimiter|column|delimiter|column|delimiter|...
    +   * For example (if delimiter is ","):
    +   *   0001,apple,banana
    +   * "0001" is rowkey field while "apple" and "banana" are column fields.
    +   *
    +   * @param rdd [[org.apache.spark.rdd.RDD[Text]]]
    +   * @param zkHost the zookeeper hosts. e.g. "10.232.98.10,10.232.98.11,10.232.98.12"
    +   * @param zkPort the zookeeper client listening port. e.g. "2181"
    +   * @param zkNode the zookeeper znode of HBase. e.g. "hbase-apache"
    +   * @param table the name of table which we save records
    +   * @param rowkeyType the type of rowkey. [[org.apache.spark.nosql.hbase.HBaseType]]
    +   * @param columns the column list. [[org.apache.spark.nosql.hbase.HBaseColumn]]
    +   * @param delimiter the delimiter which used to split record into fields
    +   */
    +  def saveAsHBaseTable(rdd: RDD[Text], zkHost: String, zkPort: String, zkNode: String,
    +                       table: String, rowkeyType: String, columns: List[HBaseColumn], delimiter: Char) {
    +    val conf = new HBaseConf(zkHost, zkPort, zkNode, table, rowkeyType, columns, delimiter)
    +
    +    def writeToHBase(iter: Iterator[Text]) {
    +      val writer = new SparkHBaseWriter(conf)
    +
    +      try {
    +        writer.init()
    +
    +        while (iter.hasNext) {
    +          val record = iter.next()
    +          writer.write(record)
    +        }
    +      } finally {
    +        writer.close()
    --- End diff --
    
    I thought try/catch blocks would be added here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38523893
  
    >Let me know if you have any questions about how you could use this functionality.
    
    Thank you very much! :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862584
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/SparkHBaseWriter.scala ---
    @@ -0,0 +1,139 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.hbase.client.{Put, HTable}
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.hbase.util.Bytes
    +import org.apache.commons.codec.binary.Hex
    +import org.apache.hadoop.hbase.HConstants
    +import org.apache.hadoop.conf.Configuration
    +import java.io.IOException
    +
    +/**
    + * Internal helper class that saves an RDD using a HBase OutputFormat. This is only public
    + * because we need to access this class from the `spark` package to use some package-private HBase
    + * functions, but this class should not be used directly by users.
    + */
    +private[apache]
    --- End diff --
    
    If this is internal to hbase package, restrict visibility to under hbase and not apache.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10863237
  
    --- Diff: external/hbase/src/test/scala/org/apache/spark/nosql/hbase/HBaseSuite.scala ---
    @@ -0,0 +1,48 @@
    +package org.apache.spark.nosql.hbase
    +
    +import org.scalatest.FunSuite
    +import org.apache.hadoop.io.Text
    +import org.apache.spark.{SparkContext, LocalSparkContext}
    +import org.apache.hadoop.hbase.util.Bytes
    +import org.apache.hadoop.hbase.{HConstants, HBaseTestingUtility}
    +import org.apache.hadoop.hbase.client.{Scan, HTable}
    +
    +class HBaseSuite extends FunSuite with LocalSparkContext {
    +
    +  test("write SequenceFile using HBase") {
    +    sc = new SparkContext("local", "test")
    +    val nums = sc.makeRDD(1 to 3).map(x => new Text("a" + x + " 1.0"))
    +
    +    val table = "test"
    +    val rowkeyType = HBaseType.String
    +    val cfBytes = Bytes.toBytes("cf")
    +    val qualBytes = Bytes.toBytes("qual0")
    +    val columns = List[HBaseColumn](new HBaseColumn(cfBytes, qualBytes, HBaseType.Float))
    +    val delimiter = ' '
    +
    +    val util = new HBaseTestingUtility()
    +    util.startMiniCluster()
    +    util.createTable(Bytes.toBytes(table), cfBytes)
    +    val conf = util.getConfiguration
    +    val zkHost = conf.get(HConstants.ZOOKEEPER_QUORUM)
    +    val zkPort = conf.get(HConstants.ZOOKEEPER_CLIENT_PORT)
    +    val zkNode = conf.get(HConstants.ZOOKEEPER_ZNODE_PARENT)
    +
    +    HBaseUtils.saveAsHBaseTable(nums, zkHost, zkPort, zkNode, table, rowkeyType, columns, delimiter)
    +
    +    val htable = new HTable(conf, table)
    +    val scan = new Scan()
    +    val rs = htable.getScanner(scan)
    --- End diff --
    
    Sorry, I forgot it. Have fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38329364
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add spark-hbase.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38258918
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-39718976
  
    Quiet confused about `InputStreamsSuite`. I pass it in my local machine. And this is a case from `streaming` module, I think my pull request didn't have any related code about this module.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-39719475
  
    The error from `https://travis-ci.org/apache/spark/builds/22424147`. I would trigger travis again after others fix that bug on master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38329366
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13327/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by tedyu <gi...@git.apache.org>.
Github user tedyu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862788
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/SparkHBaseWriter.scala ---
    @@ -0,0 +1,139 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.hbase.client.{Put, HTable}
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.hbase.util.Bytes
    +import org.apache.commons.codec.binary.Hex
    +import org.apache.hadoop.hbase.HConstants
    +import org.apache.hadoop.conf.Configuration
    +import java.io.IOException
    +
    +/**
    + * Internal helper class that saves an RDD using a HBase OutputFormat. This is only public
    + * because we need to access this class from the `spark` package to use some package-private HBase
    + * functions, but this class should not be used directly by users.
    + */
    +private[apache]
    +class SparkHBaseWriter(conf: HBaseConf) {
    +
    +  private var htable: HTable = null
    +
    +  val zkHost = conf.zkHost
    +  val zkPort = conf.zkPort
    +  val zkNode = conf.zkNode
    +  val table = conf.table
    +  val rowkeyType = conf.rowkeyType
    +  val columns = conf.columns
    +  val delimiter = conf.delimiter
    +
    +  def init() {
    +    val conf = new Configuration()
    +    conf.set(HConstants.ZOOKEEPER_QUORUM, zkHost)
    +    conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, zkPort)
    +    conf.set(HConstants.ZOOKEEPER_ZNODE_PARENT, zkNode)
    +    htable = new HTable(conf, table)
    +    // Use default writebuffersize to submit batch puts
    +    htable.setAutoFlush(false)
    +  }
    +
    +  /**
    +   * Convert field to bytes
    +   * @param field split by delimiter from record
    +   * @param kind the type of field
    +   * @return
    +   */
    +  def toByteArr(field: String, kind: String) = kind match {
    +    case HBaseType.Boolean => Bytes.toBytes(field.toBoolean)
    +    case HBaseType.Short => Bytes.toBytes(field.toShort)
    +    case HBaseType.Int => Bytes.toBytes(field.toInt)
    +    case HBaseType.Long => Bytes.toBytes(field.toLong)
    +    case HBaseType.Float => Bytes.toBytes(field.toFloat)
    +    case HBaseType.Double => Bytes.toBytes(field.toDouble)
    +    case HBaseType.String => Bytes.toBytes(field)
    +    case HBaseType.Bytes => Hex.decodeHex(field.toCharArray)
    +    case _ => throw new IOException("Unsupported data type.")
    +  }
    +
    +  /**
    +   * Convert a string record to [[org.apache.hadoop.hbase.client.Put]]
    +   * @param record
    +   * @return
    +   */
    +  def parseRecord(record: String) = {
    +    val fields = record.split(delimiter)
    +    val put = new Put(toByteArr(fields(0), rowkeyType))
    +
    +    List.range(1, fields.size) foreach {
    +      i => put.add(columns(i - 1).family, columns(i - 1).qualifier,
    +        toByteArr(fields(i), columns(i - 1).typ))
    +    }
    +
    +    put
    +  }
    +
    +  def write(record: Text) {
    +    val put = parseRecord(record.toString)
    +    htable.put(put)
    +  }
    +
    +  def close() {
    +    htable.close()
    --- End diff --
    
    If there is problem in init(), htable may be null.
    Add a null check here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-39821743
  
    @pwendell @marmbrus Could you help me review this pull request again? Thank you in advance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38259129
  
    ping @blackniuza


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38324071
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862664
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/HBaseUtils.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.io.Text
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * A public object that provide HBase supports.
    + * You could save RDD into HBase through [[org.apache.spark.nosql.hbase.HBaseUtils.saveAsHBaseTable]] method.
    + */
    +object HBaseUtils {
    +
    +  /**
    +   * Save [[org.apache.spark.rdd.RDD[Text]]] as a HBase table
    +   * @param rdd [[org.apache.spark.rdd.RDD[Text]]]
    +   * @param zkHost the zookeeper hosts. e.g. "10.232.98.10,10.232.98.11,10.232.98.12"
    +   * @param zkPort the zookeeper client listening port. e.g. "2181"
    +   * @param zkNode the zookeeper znode of HBase. e.g. "hbase-apache"
    +   * @param table the name of table which we save records
    +   * @param rowkeyType the type of rowkey. [[org.apache.spark.nosql.hbase.HBaseType]]
    +   * @param columns the column list. [[org.apache.spark.nosql.hbase.HBaseColumn]]
    +   * @param delimiter the delimiter which used to split record into fields
    +   */
    +  def saveAsHBaseTable(rdd: RDD[Text], zkHost: String, zkPort: String, zkNode: String,
    +                       table: String, rowkeyType: String, columns: List[HBaseColumn], delimiter: Char) {
    +    val conf = new HBaseConf(zkHost, zkPort, zkNode, table, rowkeyType, columns, delimiter)
    +
    +    def writeToHBase(iter: Iterator[Text]) {
    +      val writer = new SparkHBaseWriter(conf)
    +
    +      try {
    +        writer.init()
    +        while (iter.hasNext) {
    +          val record = iter.next()
    +          writer.write(record)
    +        }
    +      } finally {
    +        writer.close()
    +      }
    --- End diff --
    
    @mridulm Thank you for your clarify. Let me fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862647
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/SparkHBaseWriter.scala ---
    @@ -0,0 +1,139 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.hbase.client.{Put, HTable}
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.hbase.util.Bytes
    +import org.apache.commons.codec.binary.Hex
    +import org.apache.hadoop.hbase.HConstants
    +import org.apache.hadoop.conf.Configuration
    +import java.io.IOException
    +
    +/**
    + * Internal helper class that saves an RDD using a HBase OutputFormat. This is only public
    + * because we need to access this class from the `spark` package to use some package-private HBase
    + * functions, but this class should not be used directly by users.
    + */
    +private[apache]
    +class SparkHBaseWriter(conf: HBaseConf) {
    +
    +  private var htable: HTable = null
    +
    +  val zkHost = conf.zkHost
    +  val zkPort = conf.zkPort
    +  val zkNode = conf.zkNode
    +  val table = conf.table
    +  val rowkeyType = conf.rowkeyType
    +  val columns = conf.columns
    +  val delimiter = conf.delimiter
    +
    +  def init() {
    +    val conf = new Configuration()
    +    conf.set(HConstants.ZOOKEEPER_QUORUM, zkHost)
    +    conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, zkPort)
    +    conf.set(HConstants.ZOOKEEPER_ZNODE_PARENT, zkNode)
    +    htable = new HTable(conf, table)
    +    // Use default writebuffersize to submit batch puts
    +    htable.setAutoFlush(false)
    +  }
    +
    +  /**
    +   * Convert field to bytes
    +   * @param field split by delimiter from record
    +   * @param kind the type of field
    +   * @return
    +   */
    +  def toByteArr(field: String, kind: String) = kind match {
    +    case HBaseType.Boolean => Bytes.toBytes(field.toBoolean)
    +    case HBaseType.Short => Bytes.toBytes(field.toShort)
    +    case HBaseType.Int => Bytes.toBytes(field.toInt)
    +    case HBaseType.Long => Bytes.toBytes(field.toLong)
    +    case HBaseType.Float => Bytes.toBytes(field.toFloat)
    +    case HBaseType.Double => Bytes.toBytes(field.toDouble)
    +    case HBaseType.String => Bytes.toBytes(field)
    +    case HBaseType.Bytes => Hex.decodeHex(field.toCharArray)
    +    case _ => throw new IOException("Unsupported data type.")
    +  }
    +
    +  /**
    +   * Convert a string record to [[org.apache.hadoop.hbase.client.Put]]
    +   * @param record
    +   * @return
    +   */
    +  def parseRecord(record: String) = {
    +    val fields = record.split(delimiter)
    +    val put = new Put(toByteArr(fields(0), rowkeyType))
    +
    +    List.range(1, fields.size) foreach {
    +      i => put.add(columns(i - 1).family, columns(i - 1).qualifier,
    +        toByteArr(fields(i), columns(i - 1).typ))
    +    }
    +
    +    put
    --- End diff --
    
    I dont know how hbase is normally used by users - so I cant comment unfortunately.
    It was not clear if this was the assumption, but I inferred it based on the "- 1" in rest of the code, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38353931
  
    I made a few comments, but overall it looks good, thanks for the effort !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-65404868
  
    Thank you very much. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request:

Posted by tedyu <gi...@git.apache.org>.
Github user tedyu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10863161
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/HBaseUtils.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.io.Text
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * A public object that provide HBase supports.
    --- End diff --
    
    nit: should read:
    provides HBase support


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38881863
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-39667634
  
    @marmbrus @pwendell Could you help me review this pull request again? I have already add `saveAsHBaseTable(rdd: SchemaRDD ...)` method. Thank you in advance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-39823246
  
    Hey thanks, we can take a look after the current release deadline. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862604
  
    --- Diff: external/hbase/src/test/resources/log4j.properties ---
    @@ -0,0 +1,29 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# Set everything to be logged to the file streaming/target/unit-tests.log
    --- End diff --
    
    change comment to reflect actual file used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862676
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/SparkHBaseWriter.scala ---
    @@ -0,0 +1,139 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.hbase.client.{Put, HTable}
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.hbase.util.Bytes
    +import org.apache.commons.codec.binary.Hex
    +import org.apache.hadoop.hbase.HConstants
    +import org.apache.hadoop.conf.Configuration
    +import java.io.IOException
    +
    +/**
    + * Internal helper class that saves an RDD using a HBase OutputFormat. This is only public
    + * because we need to access this class from the `spark` package to use some package-private HBase
    + * functions, but this class should not be used directly by users.
    + */
    +private[apache]
    +class SparkHBaseWriter(conf: HBaseConf) {
    +
    +  private var htable: HTable = null
    +
    +  val zkHost = conf.zkHost
    +  val zkPort = conf.zkPort
    +  val zkNode = conf.zkNode
    +  val table = conf.table
    +  val rowkeyType = conf.rowkeyType
    +  val columns = conf.columns
    +  val delimiter = conf.delimiter
    +
    +  def init() {
    +    val conf = new Configuration()
    +    conf.set(HConstants.ZOOKEEPER_QUORUM, zkHost)
    +    conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, zkPort)
    +    conf.set(HConstants.ZOOKEEPER_ZNODE_PARENT, zkNode)
    +    htable = new HTable(conf, table)
    +    // Use default writebuffersize to submit batch puts
    +    htable.setAutoFlush(false)
    +  }
    +
    +  /**
    +   * Convert field to bytes
    +   * @param field split by delimiter from record
    +   * @param kind the type of field
    +   * @return
    +   */
    +  def toByteArr(field: String, kind: String) = kind match {
    +    case HBaseType.Boolean => Bytes.toBytes(field.toBoolean)
    +    case HBaseType.Short => Bytes.toBytes(field.toShort)
    +    case HBaseType.Int => Bytes.toBytes(field.toInt)
    +    case HBaseType.Long => Bytes.toBytes(field.toLong)
    +    case HBaseType.Float => Bytes.toBytes(field.toFloat)
    +    case HBaseType.Double => Bytes.toBytes(field.toDouble)
    +    case HBaseType.String => Bytes.toBytes(field)
    +    case HBaseType.Bytes => Hex.decodeHex(field.toCharArray)
    +    case _ => throw new IOException("Unsupported data type.")
    +  }
    +
    +  /**
    +   * Convert a string record to [[org.apache.hadoop.hbase.client.Put]]
    +   * @param record
    +   * @return
    +   */
    +  def parseRecord(record: String) = {
    +    val fields = record.split(delimiter)
    +    val put = new Put(toByteArr(fields(0), rowkeyType))
    +
    +    List.range(1, fields.size) foreach {
    +      i => put.add(columns(i - 1).family, columns(i - 1).qualifier,
    +        toByteArr(fields(i), columns(i - 1).typ))
    +    }
    +
    +    put
    --- End diff --
    
    @mridulm  @tedyu is the PMC of Apache HBase project, maybe he have some better advice about this. I would update this if he have a better idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by BlackNiuza <gi...@git.apache.org>.
Github user BlackNiuza commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-39823084
  
    This PR looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862656
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/SparkHBaseWriter.scala ---
    @@ -0,0 +1,139 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.hbase.client.{Put, HTable}
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.hbase.util.Bytes
    +import org.apache.commons.codec.binary.Hex
    +import org.apache.hadoop.hbase.HConstants
    +import org.apache.hadoop.conf.Configuration
    +import java.io.IOException
    +
    +/**
    + * Internal helper class that saves an RDD using a HBase OutputFormat. This is only public
    + * because we need to access this class from the `spark` package to use some package-private HBase
    + * functions, but this class should not be used directly by users.
    + */
    +private[apache]
    +class SparkHBaseWriter(conf: HBaseConf) {
    +
    +  private var htable: HTable = null
    +
    +  val zkHost = conf.zkHost
    +  val zkPort = conf.zkPort
    +  val zkNode = conf.zkNode
    +  val table = conf.table
    +  val rowkeyType = conf.rowkeyType
    +  val columns = conf.columns
    +  val delimiter = conf.delimiter
    +
    +  def init() {
    +    val conf = new Configuration()
    +    conf.set(HConstants.ZOOKEEPER_QUORUM, zkHost)
    +    conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, zkPort)
    +    conf.set(HConstants.ZOOKEEPER_ZNODE_PARENT, zkNode)
    +    htable = new HTable(conf, table)
    +    // Use default writebuffersize to submit batch puts
    +    htable.setAutoFlush(false)
    +  }
    +
    +  /**
    +   * Convert field to bytes
    +   * @param field split by delimiter from record
    +   * @param kind the type of field
    +   * @return
    +   */
    +  def toByteArr(field: String, kind: String) = kind match {
    --- End diff --
    
    or getBytes() whichever sounds better : both are used in java api for different reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38392478
  
    I looked at this a bit more closely. I'm definitely +1 on having some utility functions like this for writing to HBase.
    
    It seems a bit brittle to me that we expect people to go through a textual representation of the records in order to save to HBase. I think a nicer way to do this would be to go through a SchemaRDD (which is a new feature recently merged into Spark) or even a Scala case class or Scala tuples. And then have an automatic conversion into HBase types based on the runtime type of the RDD. And you'd just need to give a mapping of the attribute names to the hbase column names (/cc @marmbrust).
    
    This approach here seems a little bit more ad-hoc and like something we may not want to support for a long time going forward. So it might make sense to slot this for Spark 1.1 and re-work it to have more integrated support for schemas.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-40622251
  
    @pwendell Oh, sorry, I mistake what you say. The developing of Spark is quite fast


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-50693874
  
    Hey @javadba, sorry I didn't see this earlier. I think for sure we want to add some HBase utility functions in Spark, but exactly what functionality is still unclear... I think we should be able to look at it on the 1.2 time frame but there were enough other issues in 1.1 that we weren't able to prioritize this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by tedyu <gi...@git.apache.org>.
Github user tedyu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10855690
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/HBaseUtils.scala ---
    @@ -0,0 +1,47 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.spark.storage.StorageLevel
    +import org.apache.hadoop.io.Text
    +import org.apache.spark.TaskContext
    +import org.apache.spark.rdd.RDD
    +
    +object HBaseUtils {
    --- End diff --
    
    Add class level document, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-39661718
  
    I would update code later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by tedyu <gi...@git.apache.org>.
Github user tedyu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10863200
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/SparkHBaseWriter.scala ---
    @@ -0,0 +1,152 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.hbase.client.{Put, HTable}
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.hbase.util.Bytes
    +import org.apache.commons.codec.binary.Hex
    +import org.apache.hadoop.hbase.HConstants
    +import org.apache.hadoop.conf.Configuration
    +import java.io.IOException
    +import org.apache.spark.Logging
    +
    +/**
    + * Internal helper class that saves an RDD using a HBase OutputFormat. This is only public
    + * because we need to access this class from the `hbase` package to use some package-private HBase
    + * functions, but this class should not be used directly by users.
    + */
    +private[hbase]
    +class SparkHBaseWriter(conf: HBaseConf)
    +  extends Logging {
    +
    +  private var htable: HTable = null
    +
    +  val zkHost = conf.zkHost
    +  val zkPort = conf.zkPort
    +  val zkNode = conf.zkNode
    +  val table = conf.table
    +  val rowkeyType = conf.rowkeyType
    +  val columns = conf.columns
    +  val delimiter = conf.delimiter
    +
    +  def init() {
    +    val conf = new Configuration()
    +    conf.set(HConstants.ZOOKEEPER_QUORUM, zkHost)
    +    conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, zkPort)
    +    conf.set(HConstants.ZOOKEEPER_ZNODE_PARENT, zkNode)
    +    htable = new HTable(conf, table)
    --- End diff --
    
    See 9.3.1.1. under http://hbase.apache.org/book.html#client.connections
    
    Can be done in follow-on JIRA


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862631
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/SparkHBaseWriter.scala ---
    @@ -0,0 +1,139 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.hbase.client.{Put, HTable}
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.hbase.util.Bytes
    +import org.apache.commons.codec.binary.Hex
    +import org.apache.hadoop.hbase.HConstants
    +import org.apache.hadoop.conf.Configuration
    +import java.io.IOException
    +
    +/**
    + * Internal helper class that saves an RDD using a HBase OutputFormat. This is only public
    + * because we need to access this class from the `spark` package to use some package-private HBase
    + * functions, but this class should not be used directly by users.
    + */
    +private[apache]
    +class SparkHBaseWriter(conf: HBaseConf) {
    +
    +  private var htable: HTable = null
    +
    +  val zkHost = conf.zkHost
    +  val zkPort = conf.zkPort
    +  val zkNode = conf.zkNode
    +  val table = conf.table
    +  val rowkeyType = conf.rowkeyType
    +  val columns = conf.columns
    +  val delimiter = conf.delimiter
    +
    +  def init() {
    +    val conf = new Configuration()
    +    conf.set(HConstants.ZOOKEEPER_QUORUM, zkHost)
    +    conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, zkPort)
    +    conf.set(HConstants.ZOOKEEPER_ZNODE_PARENT, zkNode)
    +    htable = new HTable(conf, table)
    +    // Use default writebuffersize to submit batch puts
    +    htable.setAutoFlush(false)
    +  }
    +
    +  /**
    +   * Convert field to bytes
    +   * @param field split by delimiter from record
    +   * @param kind the type of field
    +   * @return
    +   */
    +  def toByteArr(field: String, kind: String) = kind match {
    +    case HBaseType.Boolean => Bytes.toBytes(field.toBoolean)
    +    case HBaseType.Short => Bytes.toBytes(field.toShort)
    +    case HBaseType.Int => Bytes.toBytes(field.toInt)
    +    case HBaseType.Long => Bytes.toBytes(field.toLong)
    +    case HBaseType.Float => Bytes.toBytes(field.toFloat)
    +    case HBaseType.Double => Bytes.toBytes(field.toDouble)
    +    case HBaseType.String => Bytes.toBytes(field)
    +    case HBaseType.Bytes => Hex.decodeHex(field.toCharArray)
    +    case _ => throw new IOException("Unsupported data type.")
    +  }
    +
    +  /**
    +   * Convert a string record to [[org.apache.hadoop.hbase.client.Put]]
    +   * @param record
    +   * @return
    +   */
    +  def parseRecord(record: String) = {
    +    val fields = record.split(delimiter)
    +    val put = new Put(toByteArr(fields(0), rowkeyType))
    +
    +    List.range(1, fields.size) foreach {
    +      i => put.add(columns(i - 1).family, columns(i - 1).qualifier,
    +        toByteArr(fields(i), columns(i - 1).typ))
    +    }
    +
    +    put
    --- End diff --
    
    >Is the assumption here that fields(0) is the pkey and rest are column values ?
    Yes. Should I write this assumption to scala document? Or use another better assumption?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent closed the pull request at:

    https://github.com/apache/spark/pull/194


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-39824171
  
    OK, I see. Thank you for your reply. :-) @pwendell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862600
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/SparkHBaseWriter.scala ---
    @@ -0,0 +1,139 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.hbase.client.{Put, HTable}
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.hbase.util.Bytes
    +import org.apache.commons.codec.binary.Hex
    +import org.apache.hadoop.hbase.HConstants
    +import org.apache.hadoop.conf.Configuration
    +import java.io.IOException
    +
    +/**
    + * Internal helper class that saves an RDD using a HBase OutputFormat. This is only public
    + * because we need to access this class from the `spark` package to use some package-private HBase
    + * functions, but this class should not be used directly by users.
    + */
    +private[apache]
    +class SparkHBaseWriter(conf: HBaseConf) {
    +
    +  private var htable: HTable = null
    +
    +  val zkHost = conf.zkHost
    +  val zkPort = conf.zkPort
    +  val zkNode = conf.zkNode
    +  val table = conf.table
    +  val rowkeyType = conf.rowkeyType
    +  val columns = conf.columns
    +  val delimiter = conf.delimiter
    +
    +  def init() {
    +    val conf = new Configuration()
    +    conf.set(HConstants.ZOOKEEPER_QUORUM, zkHost)
    +    conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, zkPort)
    +    conf.set(HConstants.ZOOKEEPER_ZNODE_PARENT, zkNode)
    +    htable = new HTable(conf, table)
    +    // Use default writebuffersize to submit batch puts
    +    htable.setAutoFlush(false)
    +  }
    +
    +  /**
    +   * Convert field to bytes
    +   * @param field split by delimiter from record
    +   * @param kind the type of field
    +   * @return
    +   */
    +  def toByteArr(field: String, kind: String) = kind match {
    +    case HBaseType.Boolean => Bytes.toBytes(field.toBoolean)
    +    case HBaseType.Short => Bytes.toBytes(field.toShort)
    +    case HBaseType.Int => Bytes.toBytes(field.toInt)
    +    case HBaseType.Long => Bytes.toBytes(field.toLong)
    +    case HBaseType.Float => Bytes.toBytes(field.toFloat)
    +    case HBaseType.Double => Bytes.toBytes(field.toDouble)
    +    case HBaseType.String => Bytes.toBytes(field)
    +    case HBaseType.Bytes => Hex.decodeHex(field.toCharArray)
    +    case _ => throw new IOException("Unsupported data type.")
    +  }
    +
    +  /**
    +   * Convert a string record to [[org.apache.hadoop.hbase.client.Put]]
    +   * @param record
    +   * @return
    +   */
    +  def parseRecord(record: String) = {
    +    val fields = record.split(delimiter)
    +    val put = new Put(toByteArr(fields(0), rowkeyType))
    +
    +    List.range(1, fields.size) foreach {
    +      i => put.add(columns(i - 1).family, columns(i - 1).qualifier,
    +        toByteArr(fields(i), columns(i - 1).typ))
    +    }
    +
    +    put
    --- End diff --
    
    Is the assumption here that fields(0) is the pkey and rest are column values ?
    is that a generic assumption we can rely on ? I did not see this constraint mentioned anyway (maybe I missed it ?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862580
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/HBaseUtils.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.io.Text
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * A public object that provide HBase supports.
    + * You could save RDD into HBase through [[org.apache.spark.nosql.hbase.HBaseUtils.saveAsHBaseTable]] method.
    + */
    +object HBaseUtils {
    +
    +  /**
    +   * Save [[org.apache.spark.rdd.RDD[Text]]] as a HBase table
    +   * @param rdd [[org.apache.spark.rdd.RDD[Text]]]
    +   * @param zkHost the zookeeper hosts. e.g. "10.232.98.10,10.232.98.11,10.232.98.12"
    +   * @param zkPort the zookeeper client listening port. e.g. "2181"
    +   * @param zkNode the zookeeper znode of HBase. e.g. "hbase-apache"
    +   * @param table the name of table which we save records
    +   * @param rowkeyType the type of rowkey. [[org.apache.spark.nosql.hbase.HBaseType]]
    +   * @param columns the column list. [[org.apache.spark.nosql.hbase.HBaseColumn]]
    +   * @param delimiter the delimiter which used to split record into fields
    +   */
    +  def saveAsHBaseTable(rdd: RDD[Text], zkHost: String, zkPort: String, zkNode: String,
    +                       table: String, rowkeyType: String, columns: List[HBaseColumn], delimiter: Char) {
    +    val conf = new HBaseConf(zkHost, zkPort, zkNode, table, rowkeyType, columns, delimiter)
    +
    +    def writeToHBase(iter: Iterator[Text]) {
    +      val writer = new SparkHBaseWriter(conf)
    +
    +      try {
    +        writer.init()
    +        while (iter.hasNext) {
    +          val record = iter.next()
    +          writer.write(record)
    +        }
    +      } finally {
    +        writer.close()
    +      }
    --- End diff --
    
    if init fails, how does close behave ? is it a noop ?
    btw, since close can throw exceptions of its own, might be good idea to wrap it in try/catch - so that if finally is getting hit via an exception in the try block, user will know the actual reason for the exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-39661707
  
    @marmbrus @pwendell I restart this issue today. After learn the sources relate `SchemaRDD`, I think the better approach is to provide `saveAsHBaseTable(rdd: RDD[Text], ...)` and `saveAsHBaseTable(rdd: SchemaRDD, ...)`both. HBase is quite different from RDBMS. `SchemaRDD` assume every cell in `Row` have `name` and `dataType`. This assumption is OK for Hive or Parquet. But for HBase, this assumption lose some important parts. In HBase, all data are stored in the `Array[Byte]` and don't have `dataType`. And for every cell in HBase, it have rowkey(like index in RDBMS), qualifier(like `name` above) and column family. Column family couldn't be represent in `SchemaRDD`.
    
    So for some user have specific requirements to set column families, we could provide `saveAsHBaseTable(rdd: RDD[Text], ...)` and tell the user how to use it. It provide the max flexibility to use HBase for user. On the other hand, `saveAsHBaseTable(rdd: Schema, ...)` is also necessary for user which have only a column family. We could set a fixed column family in the initialization of `SparkHBaseWriter` to work around the problem above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by tedyu <gi...@git.apache.org>.
Github user tedyu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862801
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/HBaseUtils.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.io.Text
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * A public object that provide HBase supports.
    + * You could save RDD into HBase through [[org.apache.spark.nosql.hbase.HBaseUtils.saveAsHBaseTable]] method.
    + */
    +object HBaseUtils {
    +
    +  /**
    +   * Save [[org.apache.spark.rdd.RDD[Text]]] as a HBase table
    +   * @param rdd [[org.apache.spark.rdd.RDD[Text]]]
    +   * @param zkHost the zookeeper hosts. e.g. "10.232.98.10,10.232.98.11,10.232.98.12"
    +   * @param zkPort the zookeeper client listening port. e.g. "2181"
    +   * @param zkNode the zookeeper znode of HBase. e.g. "hbase-apache"
    +   * @param table the name of table which we save records
    +   * @param rowkeyType the type of rowkey. [[org.apache.spark.nosql.hbase.HBaseType]]
    +   * @param columns the column list. [[org.apache.spark.nosql.hbase.HBaseColumn]]
    +   * @param delimiter the delimiter which used to split record into fields
    +   */
    +  def saveAsHBaseTable(rdd: RDD[Text], zkHost: String, zkPort: String, zkNode: String,
    +                       table: String, rowkeyType: String, columns: List[HBaseColumn], delimiter: Char) {
    +    val conf = new HBaseConf(zkHost, zkPort, zkNode, table, rowkeyType, columns, delimiter)
    +
    +    def writeToHBase(iter: Iterator[Text]) {
    +      val writer = new SparkHBaseWriter(conf)
    +
    +      try {
    +        writer.init()
    +        while (iter.hasNext) {
    +          val record = iter.next()
    +          writer.write(record)
    +        }
    +      } finally {
    +        writer.close()
    +      }
    --- End diff --
    
    See my comment in close() method below.
    If init() throws exception, htable would be null. With an additional null check, writer.close() should be a no-op.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38322773
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38258966
  
    @pwendell @tedyu Could you help me to review this. Thank you in advance. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-54694754
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10863265
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/HBaseUtils.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.io.Text
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * A public object that provide HBase supports.
    + * You could save RDD into HBase through [[org.apache.spark.nosql.hbase.HBaseUtils.saveAsHBaseTable]] method.
    + */
    +object HBaseUtils {
    +
    +  /**
    +   * Save [[org.apache.spark.rdd.RDD[Text]]] as a HBase table
    +   *
    +   * The format of record in RDD should looks like this:
    +   *   rowkey|delimiter|column|delimiter|column|delimiter|...
    +   * For example (if delimiter is ","):
    +   *   0001,apple,banana
    +   * "0001" is rowkey field while "apple" and "banana" are column fields.
    +   *
    +   * @param rdd [[org.apache.spark.rdd.RDD[Text]]]
    +   * @param zkHost the zookeeper hosts. e.g. "10.232.98.10,10.232.98.11,10.232.98.12"
    +   * @param zkPort the zookeeper client listening port. e.g. "2181"
    +   * @param zkNode the zookeeper znode of HBase. e.g. "hbase-apache"
    +   * @param table the name of table which we save records
    +   * @param rowkeyType the type of rowkey. [[org.apache.spark.nosql.hbase.HBaseType]]
    +   * @param columns the column list. [[org.apache.spark.nosql.hbase.HBaseColumn]]
    +   * @param delimiter the delimiter which used to split record into fields
    +   */
    +  def saveAsHBaseTable(rdd: RDD[Text], zkHost: String, zkPort: String, zkNode: String,
    +                       table: String, rowkeyType: String, columns: List[HBaseColumn], delimiter: Char) {
    +    val conf = new HBaseConf(zkHost, zkPort, zkNode, table, rowkeyType, columns, delimiter)
    +
    +    def writeToHBase(iter: Iterator[Text]) {
    +      val writer = new SparkHBaseWriter(conf)
    +
    +      try {
    +        writer.init()
    +
    +        while (iter.hasNext) {
    +          val record = iter.next()
    +          writer.write(record)
    +        }
    +      } finally {
    +        writer.close()
    --- End diff --
    
    Yep, move it out would be more clear. Have already update this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by javadba <gi...@git.apache.org>.
Github user javadba commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-49133737
  
    Hi,   the referenced PR Spark-1416 includes the following comment by @MLnick:   
    
    "But looking at the HBase PR you referenced, I don't see the value of having that live in Spark. And why is it not simply using an OutputFormat instead of custom config and writing code? (I might be missing something here, but it seems to add complexity and maintenance burden unnecessarily)"
    
    Patrick: would you mind to tell us whether that comment were going to be affect this PR?  We are going to be providing a significant chunk of HBase functionality and would like to know whether to build off of this PR or not. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38503577
  
    Hi @haosdent,
    
    As patrick said, I think using a Schema RDD as the input would simplify this quite a bit and avoid the need to define your own custom column mapping.  Schema RDDs already have named columns and datatypes.  Let me know if you have any questions about how you could use this functionality.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862614
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/HBaseUtils.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.io.Text
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * A public object that provide HBase supports.
    + * You could save RDD into HBase through [[org.apache.spark.nosql.hbase.HBaseUtils.saveAsHBaseTable]] method.
    + */
    +object HBaseUtils {
    +
    +  /**
    +   * Save [[org.apache.spark.rdd.RDD[Text]]] as a HBase table
    +   * @param rdd [[org.apache.spark.rdd.RDD[Text]]]
    +   * @param zkHost the zookeeper hosts. e.g. "10.232.98.10,10.232.98.11,10.232.98.12"
    +   * @param zkPort the zookeeper client listening port. e.g. "2181"
    +   * @param zkNode the zookeeper znode of HBase. e.g. "hbase-apache"
    +   * @param table the name of table which we save records
    +   * @param rowkeyType the type of rowkey. [[org.apache.spark.nosql.hbase.HBaseType]]
    +   * @param columns the column list. [[org.apache.spark.nosql.hbase.HBaseColumn]]
    +   * @param delimiter the delimiter which used to split record into fields
    +   */
    +  def saveAsHBaseTable(rdd: RDD[Text], zkHost: String, zkPort: String, zkNode: String,
    +                       table: String, rowkeyType: String, columns: List[HBaseColumn], delimiter: Char) {
    +    val conf = new HBaseConf(zkHost, zkPort, zkNode, table, rowkeyType, columns, delimiter)
    +
    +    def writeToHBase(iter: Iterator[Text]) {
    +      val writer = new SparkHBaseWriter(conf)
    +
    +      try {
    +        writer.init()
    +        while (iter.hasNext) {
    +          val record = iter.next()
    +          writer.write(record)
    +        }
    +      } finally {
    +        writer.close()
    +      }
    --- End diff --
    
    @mridulm Thanks for your great help! I am not a native English speaker. Do you mean the code should looks like this?
    
    ```java
        def writeToHBase(iter: Iterator[Text]) {
          val writer = new SparkHBaseWriter(conf)
    
          writer.init()
          while (iter.hasNext) {
            val record = iter.next()
            writer.write(record)
          }
          writer.close()
        }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-44720047
  
    ping @pwendell , 1.0 have been released~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10868441
  
    --- Diff: external/hbase/pom.xml ---
    @@ -0,0 +1,92 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    +<!--
    +  ~ Licensed to the Apache Software Foundation (ASF) under one or more
    +  ~ contributor license agreements.  See the NOTICE file distributed with
    +  ~ this work for additional information regarding copyright ownership.
    +  ~ The ASF licenses this file to You under the Apache License, Version 2.0
    +  ~ (the "License"); you may not use this file except in compliance with
    +  ~ the License.  You may obtain a copy of the License at
    +  ~
    +  ~    http://www.apache.org/licenses/LICENSE-2.0
    +  ~
    +  ~ Unless required by applicable law or agreed to in writing, software
    +  ~ distributed under the License is distributed on an "AS IS" BASIS,
    +  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  ~ See the License for the specific language governing permissions and
    +  ~ limitations under the License.
    +  -->
    +
    +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    +         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    +    <modelVersion>4.0.0</modelVersion>
    +    <parent>
    +        <groupId>org.apache.spark</groupId>
    +        <artifactId>spark-parent</artifactId>
    +        <version>1.0.0-SNAPSHOT</version>
    +        <relativePath>../../pom.xml</relativePath>
    +    </parent>
    +
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-nosql-hbase</artifactId>
    +    <packaging>jar</packaging>
    +    <name>Spark Project External HBase</name>
    +    <url>http://spark.apache.org/</url>
    +
    +    <dependencies>
    +        <dependency>
    +            <groupId>org.apache.spark</groupId>
    +            <artifactId>spark-core_${scala.binary.version}</artifactId>
    +            <version>${project.version}</version>
    +        </dependency>
    +        <dependency>
    +            <groupId>org.apache.spark</groupId>
    +            <artifactId>spark-core_${scala.binary.version}</artifactId>
    +            <version>${project.version}</version>
    +            <type>test-jar</type>
    +            <scope>test</scope>
    +        </dependency>
    +        <dependency>
    +            <groupId>org.apache.hbase</groupId>
    +            <artifactId>hbase</artifactId>
    +            <version>${hbase.version}</version>
    +        </dependency>
    +        <dependency>
    +            <groupId>org.apache.hbase</groupId>
    +            <artifactId>hbase</artifactId>
    +            <version>${hbase.version}</version>
    +            <type>test-jar</type>
    +            <scope>test</scope>
    +        </dependency>
    +        <dependency>
    --- End diff --
    
    Don't these come automatically from spark-core? Is there a reason you need to declare them here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38391364
  
    @haosdent could you add this to the SBT build as well? Look at how the other external projects work (e.g. twitter, mqtt, etc)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38404602
  
    > I think a nicer way to do this would be to go through a SchemaRDD (which is a new feature recently merged into Spark) or even a Scala case class or Scala tuples.
    
    @pwendell Thanks for your advice. Let me update it.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38382577
  
    @mridulm I think the code is well now. Do you have any advice? And how could I merge this pull request? Thank you very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10863248
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/SparkHBaseWriter.scala ---
    @@ -0,0 +1,152 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.hbase.client.{Put, HTable}
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.hbase.util.Bytes
    +import org.apache.commons.codec.binary.Hex
    +import org.apache.hadoop.hbase.HConstants
    +import org.apache.hadoop.conf.Configuration
    +import java.io.IOException
    +import org.apache.spark.Logging
    +
    +/**
    + * Internal helper class that saves an RDD using a HBase OutputFormat. This is only public
    + * because we need to access this class from the `hbase` package to use some package-private HBase
    + * functions, but this class should not be used directly by users.
    + */
    +private[hbase]
    +class SparkHBaseWriter(conf: HBaseConf)
    +  extends Logging {
    +
    +  private var htable: HTable = null
    +
    +  val zkHost = conf.zkHost
    +  val zkPort = conf.zkPort
    +  val zkNode = conf.zkNode
    +  val table = conf.table
    +  val rowkeyType = conf.rowkeyType
    +  val columns = conf.columns
    +  val delimiter = conf.delimiter
    +
    +  def init() {
    +    val conf = new Configuration()
    +    conf.set(HConstants.ZOOKEEPER_QUORUM, zkHost)
    +    conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, zkPort)
    +    conf.set(HConstants.ZOOKEEPER_ZNODE_PARENT, zkNode)
    +    htable = new HTable(conf, table)
    --- End diff --
    
    The dependence of HBase in Spark is 0.94.6, so I use new HTable() here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38357573
  
    I change the close method of writer to : 
    
    ```scala
    def close() {
        Option(htable) match {
          case Some(t) => {
            try {
              t.close()
            } catch {
              case ex: Exception => logWarning("Close HBase table failed.", ex)
            }
          }
          case None => logWarning("HBase table variable is null!")
        }
      }
    ```
    
    And add the assumption to the scala document:
    
    ```
    /**
       * Save [[org.apache.spark.rdd.RDD[Text]]] as a HBase table
       *
       * The format of record in RDD should looks like this:
       *   rowkey|delimiter|column|delimiter|column|delimiter|...
       * For example (if delimiter is ","):
       *   0001,apple,banana
       * "0001" is rowkey field while "apple" and "banana" are column fields.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-38324070
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by tedyu <gi...@git.apache.org>.
Github user tedyu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10855766
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/HBaseUtils.scala ---
    @@ -0,0 +1,47 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.spark.storage.StorageLevel
    +import org.apache.hadoop.io.Text
    +import org.apache.spark.TaskContext
    +import org.apache.spark.rdd.RDD
    +
    +object HBaseUtils {
    +  /**
    +   * Save RDD to HBase
    +   */
    +  def saveAsHBaseTable(rdd: RDD[Text], zkHost: String, zkPort: String, zkNode: String,
    +                       table: String, rowkeyType: String, columns: List[HBaseColumn], delimiter: Char) {
    +    val conf = new HBaseConf(zkHost, zkPort, zkNode, table, rowkeyType, columns, delimiter)
    +
    +    def writeToHBase(iter: Iterator[Text]) {
    +      val writer = new SparkHBaseWriter(conf)
    +      writer.init()
    +
    +      while (iter.hasNext) {
    +        val record = iter.next()
    +        writer.write(record)
    +      }
    +
    +      writer.close()
    --- End diff --
    
    Should this be enclosed in finally block ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by haosdent <gi...@git.apache.org>.
Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-40618129
  
    @pwendell Sorry for disturb you again. I saw 0.9.1 have been released, could you help me review this pull request again? Thank you very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-40620786
  
    @haosdent the release I'm talking about is 1.0 which is currently being worked on


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/194#discussion_r10862585
  
    --- Diff: external/hbase/src/main/scala/org/apache/spark/nosql/hbase/SparkHBaseWriter.scala ---
    @@ -0,0 +1,139 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.nosql.hbase
    +
    +import org.apache.hadoop.hbase.client.{Put, HTable}
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.hbase.util.Bytes
    +import org.apache.commons.codec.binary.Hex
    +import org.apache.hadoop.hbase.HConstants
    +import org.apache.hadoop.conf.Configuration
    +import java.io.IOException
    +
    +/**
    + * Internal helper class that saves an RDD using a HBase OutputFormat. This is only public
    + * because we need to access this class from the `spark` package to use some package-private HBase
    + * functions, but this class should not be used directly by users.
    + */
    +private[apache]
    +class SparkHBaseWriter(conf: HBaseConf) {
    +
    +  private var htable: HTable = null
    +
    +  val zkHost = conf.zkHost
    +  val zkPort = conf.zkPort
    +  val zkNode = conf.zkNode
    +  val table = conf.table
    +  val rowkeyType = conf.rowkeyType
    +  val columns = conf.columns
    +  val delimiter = conf.delimiter
    +
    +  def init() {
    +    val conf = new Configuration()
    +    conf.set(HConstants.ZOOKEEPER_QUORUM, zkHost)
    +    conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, zkPort)
    +    conf.set(HConstants.ZOOKEEPER_ZNODE_PARENT, zkNode)
    +    htable = new HTable(conf, table)
    +    // Use default writebuffersize to submit batch puts
    +    htable.setAutoFlush(false)
    +  }
    +
    +  /**
    +   * Convert field to bytes
    +   * @param field split by delimiter from record
    +   * @param kind the type of field
    +   * @return
    +   */
    +  def toByteArr(field: String, kind: String) = kind match {
    --- End diff --
    
    rename as toByteArray


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---