You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Brandon Geise <br...@gmail.com> on 2018/08/04 13:43:08 UTC
Spark-Phoenix Plugin
Good morning,
I’m looking at using a combination of Hbase, Phoenix and Spark for a project and read that using the Spark-Phoenix plugin directly is more efficient than JDBC, however it wasn’t entirely clear from examples when writing a dataframe if an upsert is performed and how much fine-grained options there are for executing the upsert. Any information someone can share would be greatly appreciated!
Thanks,
Brandon
Re: Spark-Phoenix Plugin
Posted by Josh Elser <el...@apache.org>.
Besides the distribution and parallelism of Spark as a distributed
execution framework, I can't really see how phoenix-spark would be
faster than the JDBC driver :). Phoenix-spark and the JDBC driver are
using the same code under the hood.
Phoenix-spark is using the PhoenixOutputFormat (and thus,
PhoenixRecordWriter) to write data to Phoenix. Maybe look at
PhoenixRecordWritable, too. These ultimately are executing UPSERTs on a
PreparedStatement.
There is also the CsvBulkLoadTool which can create HFiles to bulk load
data in Phoenix. I'm not sure if phoenix-spark has something wired up
that you can use to do this out of the box (certainly, you could do it
yourself).
On 8/6/18 8:10 AM, Brandon Geise wrote:
> Thanks for the reply Yun.
>
> I’m not quite clear how this would exactly help on the upsert side? Are
> you suggesting deriving the type from Phoenix then doing the
> encoding/decoding and writing/reading directly from HBase?
>
> Thanks,
>
> Brandon
>
> *From: *Jaanai Zhang <cl...@gmail.com>
> *Reply-To: *<us...@phoenix.apache.org>
> *Date: *Sunday, August 5, 2018 at 9:34 PM
> *To: *<us...@phoenix.apache.org>
> *Subject: *Re: Spark-Phoenix Plugin
>
> You can get data type from Phoenix meta, then encode/decode data to
> write/read data. I think this way is effective, FYI :)
>
>
> ----------------------------------------
>
> Yun Zhang
>
> Best regards!
>
> 2018-08-04 21:43 GMT+08:00 Brandon Geise <brandongeise@gmail.com
> <ma...@gmail.com>>:
>
> Good morning,
>
> I’m looking at using a combination of Hbase, Phoenix and Spark for a
> project and read that using the Spark-Phoenix plugin directly is
> more efficient than JDBC, however it wasn’t entirely clear from
> examples when writing a dataframe if an upsert is performed and how
> much fine-grained options there are for executing the upsert. Any
> information someone can share would be greatly appreciated!
>
> Thanks,
>
> Brandon
>
Re: Spark-Phoenix Plugin
Posted by James Taylor <ja...@apache.org>.
For the UPSERTs on a PreparedStatement that are done by Phoenix for writing
in the Spark adapter, not that these are *not* doing RPCs to the HBase
server to write data (i.e. they are never committed). Instead the UPSERTs
are used to ensure that the correct serialization is performed given the
Phoenix schema. We use a PhoenixRuntime API to get the List<Cell> from the
uncommitted data and then perform a rollback. Using this technique,
features like salting, column encoding, row timestamp, etc. will continue
to work with the Spark integration.
Thanks,
James
On Mon, Aug 6, 2018 at 7:44 AM, Jaanai Zhang <cl...@gmail.com> wrote:
> you can get better performance if directly read/write HBase. you also use
> spark-phoenix, this is an example, reading data from CSV file and writing
> into Phoenix table:
>
> def main(args: Array[String]): Unit = {
>
> val sc = new SparkContext("local", "phoenix-test")
> val path = "/tmp/data"
> val hbaseConnectionString = "host1,host2,host3"
> val customSchema = StructType(Array(
> StructField("O_ORDERKEY", StringType, true),
> StructField("O_CUSTKEY", StringType, true),
> StructField("O_ORDERSTATUS", StringType, true),
> StructField("O_TOTALPRICE", StringType, true),
> StructField("O_ORDERDATE", StringType, true),
> StructField("O_ORDERPRIORITY", StringType, true),
> StructField("O_CLERK", StringType, true),
> StructField("O_SHIPPRIORITY", StringType, true),
> StructField("O_COMMENT", StringType, true)))
>
> // import com.databricks.spark.csv._
> val sqlContext = new SQLContext(sc)
>
> val df = sqlContext.read
> .format("com.databricks.spark.csv")
> .option("delimiter", "|")
> .option("header", "false")
> .schema(customSchema)
> .load(path)
>
> val start = System.currentTimeMillis()
> df.write.format("org.apache.phoenix.spark")
> .mode("overwrite")
> .option("table", "DATAX")
> .option("zkUrl", hbaseConnectionString)
> .save()
>
> val end = System.currentTimeMillis()
> print("taken time:" + ((end - start) / 1000) + "s")
> }
>
>
>
>
> ----------------------------------------
> Yun Zhang
> Best regards!
>
>
> 2018-08-06 20:10 GMT+08:00 Brandon Geise <br...@gmail.com>:
>
>> Thanks for the reply Yun.
>>
>>
>>
>> I’m not quite clear how this would exactly help on the upsert side? Are
>> you suggesting deriving the type from Phoenix then doing the
>> encoding/decoding and writing/reading directly from HBase?
>>
>>
>>
>> Thanks,
>>
>> Brandon
>>
>>
>>
>> *From: *Jaanai Zhang <cl...@gmail.com>
>> *Reply-To: *<us...@phoenix.apache.org>
>> *Date: *Sunday, August 5, 2018 at 9:34 PM
>> *To: *<us...@phoenix.apache.org>
>> *Subject: *Re: Spark-Phoenix Plugin
>>
>>
>>
>> You can get data type from Phoenix meta, then encode/decode data to
>> write/read data. I think this way is effective, FYI :)
>>
>>
>>
>>
>> ----------------------------------------
>>
>> Yun Zhang
>>
>> Best regards!
>>
>>
>>
>>
>>
>> 2018-08-04 21:43 GMT+08:00 Brandon Geise <br...@gmail.com>:
>>
>> Good morning,
>>
>>
>>
>> I’m looking at using a combination of Hbase, Phoenix and Spark for a
>> project and read that using the Spark-Phoenix plugin directly is more
>> efficient than JDBC, however it wasn’t entirely clear from examples when
>> writing a dataframe if an upsert is performed and how much fine-grained
>> options there are for executing the upsert. Any information someone can
>> share would be greatly appreciated!
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Brandon
>>
>>
>>
>
>
Re: Spark-Phoenix Plugin
Posted by Jaanai Zhang <cl...@gmail.com>.
you can get better performance if directly read/write HBase. you also use
spark-phoenix, this is an example, reading data from CSV file and writing
into Phoenix table:
def main(args: Array[String]): Unit = {
val sc = new SparkContext("local", "phoenix-test")
val path = "/tmp/data"
val hbaseConnectionString = "host1,host2,host3"
val customSchema = StructType(Array(
StructField("O_ORDERKEY", StringType, true),
StructField("O_CUSTKEY", StringType, true),
StructField("O_ORDERSTATUS", StringType, true),
StructField("O_TOTALPRICE", StringType, true),
StructField("O_ORDERDATE", StringType, true),
StructField("O_ORDERPRIORITY", StringType, true),
StructField("O_CLERK", StringType, true),
StructField("O_SHIPPRIORITY", StringType, true),
StructField("O_COMMENT", StringType, true)))
// import com.databricks.spark.csv._
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.option("header", "false")
.schema(customSchema)
.load(path)
val start = System.currentTimeMillis()
df.write.format("org.apache.phoenix.spark")
.mode("overwrite")
.option("table", "DATAX")
.option("zkUrl", hbaseConnectionString)
.save()
val end = System.currentTimeMillis()
print("taken time:" + ((end - start) / 1000) + "s")
}
----------------------------------------
Yun Zhang
Best regards!
2018-08-06 20:10 GMT+08:00 Brandon Geise <br...@gmail.com>:
> Thanks for the reply Yun.
>
>
>
> I’m not quite clear how this would exactly help on the upsert side? Are
> you suggesting deriving the type from Phoenix then doing the
> encoding/decoding and writing/reading directly from HBase?
>
>
>
> Thanks,
>
> Brandon
>
>
>
> *From: *Jaanai Zhang <cl...@gmail.com>
> *Reply-To: *<us...@phoenix.apache.org>
> *Date: *Sunday, August 5, 2018 at 9:34 PM
> *To: *<us...@phoenix.apache.org>
> *Subject: *Re: Spark-Phoenix Plugin
>
>
>
> You can get data type from Phoenix meta, then encode/decode data to
> write/read data. I think this way is effective, FYI :)
>
>
>
>
> ----------------------------------------
>
> Yun Zhang
>
> Best regards!
>
>
>
>
>
> 2018-08-04 21:43 GMT+08:00 Brandon Geise <br...@gmail.com>:
>
> Good morning,
>
>
>
> I’m looking at using a combination of Hbase, Phoenix and Spark for a
> project and read that using the Spark-Phoenix plugin directly is more
> efficient than JDBC, however it wasn’t entirely clear from examples when
> writing a dataframe if an upsert is performed and how much fine-grained
> options there are for executing the upsert. Any information someone can
> share would be greatly appreciated!
>
>
>
>
>
> Thanks,
>
> Brandon
>
>
>
Re: Spark-Phoenix Plugin
Posted by Brandon Geise <br...@gmail.com>.
Thanks for the reply Yun.
I’m not quite clear how this would exactly help on the upsert side? Are you suggesting deriving the type from Phoenix then doing the encoding/decoding and writing/reading directly from HBase?
Thanks,
Brandon
From: Jaanai Zhang <cl...@gmail.com>
Reply-To: <us...@phoenix.apache.org>
Date: Sunday, August 5, 2018 at 9:34 PM
To: <us...@phoenix.apache.org>
Subject: Re: Spark-Phoenix Plugin
You can get data type from Phoenix meta, then encode/decode data to write/read data. I think this way is effective, FYI :)
----------------------------------------
Yun Zhang
Best regards!
2018-08-04 21:43 GMT+08:00 Brandon Geise <br...@gmail.com>:
Good morning,
I’m looking at using a combination of Hbase, Phoenix and Spark for a project and read that using the Spark-Phoenix plugin directly is more efficient than JDBC, however it wasn’t entirely clear from examples when writing a dataframe if an upsert is performed and how much fine-grained options there are for executing the upsert. Any information someone can share would be greatly appreciated!
Thanks,
Brandon
Re: Spark-Phoenix Plugin
Posted by Jaanai Zhang <cl...@gmail.com>.
You can get data type from Phoenix meta, then encode/decode data to
write/read data. I think this way is effective, FYI :)
----------------------------------------
Yun Zhang
Best regards!
2018-08-04 21:43 GMT+08:00 Brandon Geise <br...@gmail.com>:
> Good morning,
>
>
>
> I’m looking at using a combination of Hbase, Phoenix and Spark for a
> project and read that using the Spark-Phoenix plugin directly is more
> efficient than JDBC, however it wasn’t entirely clear from examples when
> writing a dataframe if an upsert is performed and how much fine-grained
> options there are for executing the upsert. Any information someone can
> share would be greatly appreciated!
>
>
>
>
>
> Thanks,
>
> Brandon
>