You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Balazs Meszaros (JIRA)" <ji...@apache.org> on 2019/07/22 14:33:00 UTC

[jira] [Resolved] (HBASE-22711) Spark connector doesn't use the given mapping when inserting data

     [ https://issues.apache.org/jira/browse/HBASE-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Balazs Meszaros resolved HBASE-22711.
-------------------------------------
       Resolution: Fixed
    Fix Version/s: connector-1.0.1

> Spark connector doesn't use the given mapping when inserting data
> -----------------------------------------------------------------
>
>                 Key: HBASE-22711
>                 URL: https://issues.apache.org/jira/browse/HBASE-22711
>             Project: HBase
>          Issue Type: Bug
>          Components: hbase-connectors
>    Affects Versions: connector-1.0.0
>            Reporter: Balazs Meszaros
>            Assignee: Balazs Meszaros
>            Priority: Major
>             Fix For: connector-1.0.1
>
>
> In some cases a Spark DataFrames cannot be read back with the same mapping as they were written. For example:
> {code:scala}
> val sql = spark.sqlContext
> val persons =
>     """[
>       |{"name": "alice", "age": 20, "height": 5, "email": "alice@alice.com"},
>       |{"name": "bob", "age": 23, "height": 6, "email": "bob@bob.com"},
>       |{"name": "carol", "age": 12, "email": "carol@carol.com", "height": 4.11}
>       |]
>     """.stripMargin
> val df = spark.read.json(Seq(persons).toDS)
> df.write
>   .format("org.apache.hadoop.hbase.spark")
>   .option("hbase.columns.mapping", "name STRING :key, age SHORT p:age, email STRING c:email, height FLOAT p:height")
>   .option("hbase.table", "person")
>   .option("hbase.spark.use.hbasecontext", false)
>   .save()
> {code}
> It cannot be read back with the same mapping:
> {code:scala}
> val df2 = sql.read
>   .format("org.apache.hadoop.hbase.spark")
>   .option("hbase.columns.mapping", "name STRING :key, age SHORT p:age, email STRING c:email, height FLOAT p:height")
>   .option("hbase.table", "person")
>   .option("hbase.spark.use.hbasecontext", false)
>   .load()
> df2.createOrReplaceTempView("tableView")
> val results = sql.sql("SELECT * FROM tableView")
> results.show()
> {code}
> The results:
> {noformat}
> +---+-----+---------+---------------+
> |age| name|   height|          email|
> +---+-----+---------+---------------+
> |  0|alice|   2.3125|alice@alice.com|
> |  0|  bob|    2.375|    bob@bob.com|
> |  0|carol|2.2568748|carol@carol.com|
> +---+-----+---------+---------------+
> {noformat}
> Spark stores integer values in long, floating point values in double so shorts become 8 bytes long, floats also become 8 bytes long in HBase:
> {noformat}
> shell> scan 'person'
>  alice                column=p:age, timestamp=1563450714829, value=\x00\x00\x00\x00\x00\x00\x00\x14
>  alice                column=p:height, timestamp=1563450714829, value=@\x14\x00\x00\x00\x00\x00\x00
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)