You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Balazs Meszaros (JIRA)" <ji...@apache.org> on 2019/07/18 12:07:00 UTC

[jira] [Created] (HBASE-22711) Spark connector doesn't use the given mapping when inserting data

Balazs Meszaros created HBASE-22711:
---------------------------------------

             Summary: Spark connector doesn't use the given mapping when inserting data
                 Key: HBASE-22711
                 URL: https://issues.apache.org/jira/browse/HBASE-22711
             Project: HBase
          Issue Type: Bug
          Components: hbase-connectors
    Affects Versions: connector-1.0.0
            Reporter: Balazs Meszaros
            Assignee: Balazs Meszaros


In some cases a Spark DataFrames cannot be read back with the same mapping as they were written. For example:

{code:scala}
val sql = spark.sqlContext

val persons =
    """[
      |{"name": "alice", "age": 20, "height": 5, "email": "alice@alice.com"},
      |{"name": "bob", "age": 23, "height": 6, "email": "bob@bob.com"},
      |{"name": "carol", "age": 12, "email": "carol@carol.com", "height": 4.11}
      |]
    """.stripMargin

val df = spark.read.json(Seq(persons).toDS)

df.write
  .format("org.apache.hadoop.hbase.spark")
  .option("hbase.columns.mapping", "name STRING :key, age SHORT p:age, email STRING c:email, height FLOAT p:height")
  .option("hbase.table", "person")
  .option("hbase.spark.use.hbasecontext", false)
  .save()
{code}

It cannot be read back with the same mapping:

{code:scala}
val df2 = sql.read
  .format("org.apache.hadoop.hbase.spark")
  .option("hbase.columns.mapping", "name STRING :key, age SHORT p:age, email STRING c:email, height FLOAT p:height")
  .option("hbase.table", "person")
  .option("hbase.spark.use.hbasecontext", false)
  .load()

df2.createOrReplaceTempView("tableView")

val results = sql.sql("SELECT * FROM tableView")
results.show()
{code}

The results:

{noformat}
+---+-----+---------+---------------+
|age| name|   height|          email|
+---+-----+---------+---------------+
|  0|alice|   2.3125|alice@alice.com|
|  0|  bob|    2.375|    bob@bob.com|
|  0|carol|2.2568748|carol@carol.com|
+---+-----+---------+---------------+
{noformat}

Spark stores integer values in long, floating point values in double so shorts become 8 bytes long, floats also become 8 bytes long in HBase:

{noformat}
shell> scan 'person'
 alice                column=p:age, timestamp=1563450714829, value=\x00\x00\x00\x00\x00\x00\x00\x14
 alice                column=p:height, timestamp=1563450714829, value=@\x14\x00\x00\x00\x00\x00\x00
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)