You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Nick Dimiduk <nd...@apache.org> on 2016/01/19 23:02:44 UTC

Re: phoenix-spark and pyspark

Hi guys,

I'm doing my best to follow along with [0], but I'm hitting some stumbling
blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix build is
much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm using
pyspark for now.

I've added phoenix-$VERSION-client-spark.jar to both
spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
me to use sqlContext.read to define a DataFrame against a Phoenix table.
This appears to basically work, as I see PhoenixInputFormat in the logs and
df.printSchema() shows me what I expect. However, when I try df.take(5), I
get "IllegalStateException: unread block data" [1] from the workers. Poking
around, this is commonly a problem with classpath. Any ideas as to
specifically which jars are needed? Or better still, how to debug this
issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
gives me a VerifyError about netty method version mismatch. Indeed I see
two netty versions in that lib directory...

Thanks a lot,
-n

[0]: http://phoenix.apache.org/phoenix_spark.html
[1]:

java.lang.IllegalStateException: unread block data
at
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <ja...@apache.org>
wrote:

> Thanks for remembering about the docs, Josh.
>
> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jm...@gmail.com> wrote:
>
>> Just an update for anyone interested, PHOENIX-2503 was just committed for
>> 4.7.0 and the docs have been updated to include these samples for PySpark
>> users.
>>
>> https://phoenix.apache.org/phoenix_spark.html
>>
>> Josh
>>
>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jm...@gmail.com> wrote:
>>
>>> Hey Nick,
>>>
>>> I think this used to work, and will again once PHOENIX-2503 gets
>>> resolved. With the Spark DataFrame support, all the necessary glue is there
>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>
>>> df = sqlContext.read \
>>>   .format("org.apache.phoenix.spark") \
>>>   .option("table", "TABLE1") \
>>>   .option("zkUrl", "localhost:63512") \
>>>   .load()
>>>
>>> And
>>>
>>> df.write \
>>>   .format("org.apache.phoenix.spark") \
>>>   .mode("overwrite") \
>>>   .option("table", "TABLE1") \
>>>   .option("zkUrl", "localhost:63512") \
>>>   .save()
>>>
>>>
>>> Yes, this should be added to the documentation. I hadn't actually tried
>>> this till just now. :)
>>>
>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <nd...@apache.org>
>>> wrote:
>>>
>>>> Heya,
>>>>
>>>> Has anyone any experience using phoenix-spark integration from pyspark
>>>> instead of scala? Folks prefer python around here...
>>>>
>>>> I did find this example [0] of using HBaseOutputFormat from pyspark,
>>>> haven't tried extending it for phoenix. Maybe someone with more experience
>>>> in pyspark knows better? Would be a great addition to our documentation.
>>>>
>>>> Thanks,
>>>> Nick
>>>>
>>>> [0]:
>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>
>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Josh Mahonin <jm...@gmail.com>.
Hey Nick,

Hopefully PHOENIX-2599 works out for you, but it's entirely possible
there's some other issues that crop up. My usage patterns are probably a
bit unique, in that I generally do a few JDBC UPSERT SELECT aggregations
from my "hot" tables to intermediate ones before Spark ever involved. That
likely hides an entire class of bugs, like the kind PHOENIX-2599 fixes.

However, as you're likely aware, the Spark integration is a pretty thin
wrapper over the regular MR integration, including Pig, so hopefully
there's not too much in the way of unexplored territory there in terms of
bulk loading and saving.

Good luck!

Josh

On Sat, Jan 23, 2016 at 9:12 PM, Nick Dimiduk <nd...@apache.org> wrote:

> That looks about right. I was unaware of this patch; thanks!
>
> On Fri, Jan 22, 2016 at 5:10 PM, James Taylor <ja...@apache.org>
> wrote:
>
>> FYI, Nick - do you know about Josh's fix for PHOENIX-2599? Does that help
>> here?
>>
>> On Fri, Jan 22, 2016 at 4:32 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>>
>>> On Thu, Jan 21, 2016 at 7:36 AM, Josh Mahonin <jm...@gmail.com>
>>> wrote:
>>>
>>>> Amazing, good work.
>>>>
>>>
>>> All I did was consume your code and configure my cluster. Thanks though
>>> :)
>>>
>>> FWIW, I've got a support case in with Hortonworks to get the
>>>> phoenix-spark integration working out of the box. Assuming it gets
>>>> resolved, that'll hopefully help keep these classpath-hell issues to a
>>>> minimum going forward.
>>>>
>>>
>>> That's fine for HWX customers, but it doesn't solve the general problem
>>> community-side. Unless, of course, we assume the only Phoenix users are on
>>> HDP.
>>>
>>> Interesting point re: PHOENIX-2535. Spark does offer builds for specific
>>>> Hadoop versions, and also no Hadoop at all (with the assumption you'll
>>>> provide the necessary JARs). Phoenix is pretty tightly coupled with its own
>>>> HBase (and by extension, Hadoop) versions though... do you think it be
>>>> possible to work around (2) if you locally added the HDP Maven repo and
>>>> adjusted versions accordingly? I've had some success with that in other
>>>> projects, though as I recall when I tried it with Phoenix I ran into a snag
>>>> trying to resolve some private transitive dependency of Hadoop.
>>>>
>>>
>>> I could specify Hadoop version and build Phoenix locally to remove the
>>> issue. It would work for me because I happen to be packaging my own Phoenix
>>> this week, but it doesn't help for the Apache releases. Phoenix _is_
>>> tightly coupled to HBase versions, but I don't think HBase versions are
>>> that tightly coupled to Hadoop versions. We build HBase 1.x releases
>>> against a long list of Hadoop releases as part of our usual build. I think
>>> what HBase consumes re: Hadoop API's is pretty limited.
>>>
>>> Now that you have Spark working you can start hitting real bugs! If you
>>>> haven't backported the full patchset, you might want to take a look at the
>>>> phoenix-spark history [1], there's been a lot of churn there, especially
>>>> with regards to the DataFrame API.
>>>>
>>>
>>> Yeah, speaking of which... :)
>>>
>>> I find this integration is basically unusable when the underlying HBase
>>> table partitions are in flux; i.e., when you have data being loaded
>>> concurrent to query. Spark RDD's assume stable partitions, multiple queries
>>> against a single DataFrame, or even a single query/DataFrame with lots of
>>> underlying splits and not enough workers, will inevitably fail
>>> with ConcurrentModificationException like [0].
>>>
>>> I think in HBase/MR integration, we define split points for the job as
>>> region boundaries initially, but then we're just using the regular API, so
>>> repartitions are handled transiently by the client layer. I need to dig
>>> into this Phoenix/Spark stuff to see if we can do anything similar here.
>>>
>>> Thanks again Josh,
>>> -n
>>>
>>> [0]:
>>>
>>> java.lang.RuntimeException: java.util.ConcurrentModificationException
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:125)
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(PhoenixInputFormat.java:69)
>>> at
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:131)
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>>> at org.apache.phoenix.spark.PhoenixRDD.compute(PhoenixRDD.scala:57)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.util.ConcurrentModificationException
>>> at java.util.Hashtable$Enumerator.next(Hashtable.java:1367)
>>> at org.apache.hadoop.conf.Configuration.iterator(Configuration.java:2154)
>>> at
>>> org.apache.phoenix.util.PropertiesUtil.extractProperties(PropertiesUtil.java:52)
>>> at
>>> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:60)
>>> at
>>> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
>>> at
>>> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:278)
>>> at
>>> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectStatement(PhoenixConfigurationUtil.java:306)
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:113)
>>> ... 32 more
>>>
>>> On Thu, Jan 21, 2016 at 12:41 AM, Nick Dimiduk <nd...@apache.org>
>>>> wrote:
>>>>
>>>>> I finally got to the bottom of things. There were two issues at play
>>>>> in my particular environment.
>>>>>
>>>>> 1. An Ambari bug [0] means my spark-defaults.conf file was garbage. I
>>>>> hardly thought of it when I hit the issue with MR job submission; its
>>>>> impact on Spark was much more subtle.
>>>>>
>>>>> 2. YARN client version mismatch (Phoenix is compiled vs Apache 2.5.1
>>>>> while my cluster is running HDP's 2.7.1 build), per my earlier email. Once
>>>>> I'd worked around (1), I was able to work around (2) by setting
>>>>> "/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar"
>>>>> for both spark.driver.extraClassPath and spark.executor.extraClassPath. Per
>>>>> the linked thread above, I believe this places the correct YARN client
>>>>> version first in the classpath.
>>>>>
>>>>> With the above in place, I'm able to submit work vs the Phoenix tables
>>>>> to the YARN cluster. Success!
>>>>>
>>>>> Ironically enough, I would not have been able to work around (2) if we
>>>>> had PHOENIX-2535 in place. Food for thought in tackling that issue. It may
>>>>> be worth while to ship a uberclient jar that is entirely without Hadoop (or
>>>>> HBase) classes. I believe spark does this for Hadoop with their builds as
>>>>> well.
>>>>>
>>>>> Thanks again for your help here Josh! I really appreciate it.
>>>>>
>>>>> [0]: https://issues.apache.org/jira/browse/AMBARI-14751
>>>>>
>>>>> On Wed, Jan 20, 2016 at 2:23 PM, Nick Dimiduk <nd...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Well, I spoke too soon. It's working, but in local mode only. When I
>>>>>> invoke `pyspark --master yarn` (or yarn-client), the submitted application
>>>>>> goes from ACCEPTED to FAILED, with a NumberFormatException [0] in my
>>>>>> container log. Now that Phoenix is on my classpath, I'm suspicious that the
>>>>>> versions of YARN client libraries are incompatible. I found an old thread
>>>>>> [1] with the same stack trace I'm seeing, similar conclusion. I tried
>>>>>> setting spark.driver.extraClassPath and spark.executor.extraClassPath
>>>>>> to /usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>> but that appears to have no impact.
>>>>>>
>>>>>> [0]:
>>>>>> 16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception:
>>>>>> java.lang.IllegalArgumentException: Invalid ContainerId:
>>>>>> container_e07_1452901320122_0042_01_000001
>>>>>> at
>>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>> at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
>>>>>> Caused by: java.lang.NumberFormatException: For input string: "e07"
>>>>>> at
>>>>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>>>>> at java.lang.Long.parseLong(Long.java:589)
>>>>>> at java.lang.Long.parseLong(Long.java:631)
>>>>>> at
>>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>>>>>> at
>>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>>>>>> ... 12 more
>>>>>>
>>>>>> [1]:
>>>>>> http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=A@mail.gmail.com%3E
>>>>>>
>>>>>> On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <jm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> That's great to hear. Looking forward to the doc patch!
>>>>>>>
>>>>>>> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <nd...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Josh -- I deployed my updated phoenix build across the cluster,
>>>>>>>> added the phoenix-client-spark.jar to configs on the whole cluster, and now
>>>>>>>> basic dataframe access is now working. Let me see about updating the docs
>>>>>>>> page to be more clear, I'll send a patch by you for review.
>>>>>>>>
>>>>>>>> Thanks a lot for the help!
>>>>>>>> -n
>>>>>>>>
>>>>>>>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jm...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark
>>>>>>>>> on YARN as well. I suppose the JAR is probably shipped by YARN, though I
>>>>>>>>> don't see any logging saying it, so I'm not certain how the nuts and bolts
>>>>>>>>> of that work. By explicitly setting the classpath, we're bypassing Spark's
>>>>>>>>> native JAR broadcast though.
>>>>>>>>>
>>>>>>>>> Taking a quick look at the config in Ambari (which ships the
>>>>>>>>> config to each node after saving), in 'Custom spark-defaults' I have the
>>>>>>>>> following:
>>>>>>>>>
>>>>>>>>> spark.driver.extraClassPath ->
>>>>>>>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>>>>> spark.executor.extraClassPath ->
>>>>>>>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>>>>>
>>>>>>>>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I
>>>>>>>>> think that gets the Ambari generated hbase-site.xml in the classpath. Each
>>>>>>>>> node has the custom phoenix-client-spark.jar installed to that same path as
>>>>>>>>> well.
>>>>>>>>>
>>>>>>>>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>>>>>>>>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>>>>>>>>
>>>>>>>>> or pyspark via:
>>>>>>>>> /usr/hdp/current/spark-client/bin/pyspark
>>>>>>>>>
>>>>>>>>> I also do this as the Ambari-created 'spark' user, I think there
>>>>>>>>> was some fun HDFS permission issue otherwise.
>>>>>>>>>
>>>>>>>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <ndimiduk@apache.org
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers
>>>>>>>>>> are colocated with RegionServers; all the hosts have everything. There are
>>>>>>>>>> no spark workers to restart. You're sure it's not shipped by the YARN
>>>>>>>>>> runtime?
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jmahonin@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Sadly, it needs to be installed onto each Spark worker (for
>>>>>>>>>>> now). The executor config tells each Spark worker to look for that file to
>>>>>>>>>>> add to its classpath, so once you have it installed, you'll probably need
>>>>>>>>>>> to restart all the Spark workers.
>>>>>>>>>>>
>>>>>>>>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>>>>>>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>>>>>>>>> consistently see is fine.
>>>>>>>>>>>
>>>>>>>>>>> One day we'll be able to have Spark ship the JAR around and use
>>>>>>>>>>> it without this classpath nonsense, but we need to do some extra work on
>>>>>>>>>>> the Phoenix side to make sure that Phoenix's calls to DriverManager
>>>>>>>>>>> actually go through Spark's weird wrapper version of it.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <
>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <
>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> What version of Spark are you using?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install
>>>>>>>>>>>> say, and the welcome message in the pyspark console agrees.
>>>>>>>>>>>>
>>>>>>>>>>>> Are there any other traces of exceptions anywhere?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> No other exceptions that I can find. YARN apparently doesn't
>>>>>>>>>>>> want to aggregate spark's logs.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Are all your Spark nodes set up to point to the same
>>>>>>>>>>>>> phoenix-client-spark JAR?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, as far as I can tell... though come to think of it, is
>>>>>>>>>>>> that jar shipped by the spark runtime to workers, or is it loaded locally
>>>>>>>>>>>> on each host? I only changed spark-defaults.conf on the client machine, the
>>>>>>>>>>>> machine from which I submitted the job.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for taking a look Josh!
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <
>>>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm doing my best to follow along with [0], but I'm hitting
>>>>>>>>>>>>>> some stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My
>>>>>>>>>>>>>> phoenix build is much newer, basically 4.6-branch + PHOENIX-2503,
>>>>>>>>>>>>>> PHOENIX-2568. I'm using pyspark for now.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>>>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>>>>>>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>>>>>>>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>>>>>>>>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>>>>>>>>>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>>>>>>>>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>>>>>>>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>>>>>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>>>>>>>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>>>>>>>>>>>> two netty versions in that lib directory...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>>>> -n
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>>>> [1]:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>>>>>>>>> jamestaylor@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <
>>>>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>>>>>>>>>> committed for 4.7.0 and the docs have been updated to include these samples
>>>>>>>>>>>>>>>> for PySpark users.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Josh
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <
>>>>>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think this used to work, and will again once
>>>>>>>>>>>>>>>>> PHOENIX-2503 gets resolved. With the Spark DataFrame support, all the
>>>>>>>>>>>>>>>>> necessary glue is there for Phoenix and pyspark to play nice. With that
>>>>>>>>>>>>>>>>> client JAR (or by overriding the com.fasterxml.jackson JARS), you can do
>>>>>>>>>>>>>>>>> something like:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>>>>>   .load()
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> df.write \
>>>>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>>>>>   .save()
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, this should be added to the documentation. I hadn't
>>>>>>>>>>>>>>>>> actually tried this till just now. :)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <
>>>>>>>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Heya,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Has anyone any experience using phoenix-spark integration
>>>>>>>>>>>>>>>>>> from pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat
>>>>>>>>>>>>>>>>>> from pyspark, haven't tried extending it for phoenix. Maybe someone with
>>>>>>>>>>>>>>>>>> more experience in pyspark knows better? Would be a great addition to our
>>>>>>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [0]:
>>>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Nick Dimiduk <nd...@apache.org>.
That looks about right. I was unaware of this patch; thanks!

On Fri, Jan 22, 2016 at 5:10 PM, James Taylor <ja...@apache.org>
wrote:

> FYI, Nick - do you know about Josh's fix for PHOENIX-2599? Does that help
> here?
>
> On Fri, Jan 22, 2016 at 4:32 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>
>> On Thu, Jan 21, 2016 at 7:36 AM, Josh Mahonin <jm...@gmail.com> wrote:
>>
>>> Amazing, good work.
>>>
>>
>> All I did was consume your code and configure my cluster. Thanks though :)
>>
>> FWIW, I've got a support case in with Hortonworks to get the
>>> phoenix-spark integration working out of the box. Assuming it gets
>>> resolved, that'll hopefully help keep these classpath-hell issues to a
>>> minimum going forward.
>>>
>>
>> That's fine for HWX customers, but it doesn't solve the general problem
>> community-side. Unless, of course, we assume the only Phoenix users are on
>> HDP.
>>
>> Interesting point re: PHOENIX-2535. Spark does offer builds for specific
>>> Hadoop versions, and also no Hadoop at all (with the assumption you'll
>>> provide the necessary JARs). Phoenix is pretty tightly coupled with its own
>>> HBase (and by extension, Hadoop) versions though... do you think it be
>>> possible to work around (2) if you locally added the HDP Maven repo and
>>> adjusted versions accordingly? I've had some success with that in other
>>> projects, though as I recall when I tried it with Phoenix I ran into a snag
>>> trying to resolve some private transitive dependency of Hadoop.
>>>
>>
>> I could specify Hadoop version and build Phoenix locally to remove the
>> issue. It would work for me because I happen to be packaging my own Phoenix
>> this week, but it doesn't help for the Apache releases. Phoenix _is_
>> tightly coupled to HBase versions, but I don't think HBase versions are
>> that tightly coupled to Hadoop versions. We build HBase 1.x releases
>> against a long list of Hadoop releases as part of our usual build. I think
>> what HBase consumes re: Hadoop API's is pretty limited.
>>
>> Now that you have Spark working you can start hitting real bugs! If you
>>> haven't backported the full patchset, you might want to take a look at the
>>> phoenix-spark history [1], there's been a lot of churn there, especially
>>> with regards to the DataFrame API.
>>>
>>
>> Yeah, speaking of which... :)
>>
>> I find this integration is basically unusable when the underlying HBase
>> table partitions are in flux; i.e., when you have data being loaded
>> concurrent to query. Spark RDD's assume stable partitions, multiple queries
>> against a single DataFrame, or even a single query/DataFrame with lots of
>> underlying splits and not enough workers, will inevitably fail
>> with ConcurrentModificationException like [0].
>>
>> I think in HBase/MR integration, we define split points for the job as
>> region boundaries initially, but then we're just using the regular API, so
>> repartitions are handled transiently by the client layer. I need to dig
>> into this Phoenix/Spark stuff to see if we can do anything similar here.
>>
>> Thanks again Josh,
>> -n
>>
>> [0]:
>>
>> java.lang.RuntimeException: java.util.ConcurrentModificationException
>> at
>> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:125)
>> at
>> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(PhoenixInputFormat.java:69)
>> at
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:131)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>> at org.apache.phoenix.spark.PhoenixRDD.compute(PhoenixRDD.scala:57)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.util.ConcurrentModificationException
>> at java.util.Hashtable$Enumerator.next(Hashtable.java:1367)
>> at org.apache.hadoop.conf.Configuration.iterator(Configuration.java:2154)
>> at
>> org.apache.phoenix.util.PropertiesUtil.extractProperties(PropertiesUtil.java:52)
>> at
>> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:60)
>> at
>> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
>> at
>> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:278)
>> at
>> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectStatement(PhoenixConfigurationUtil.java:306)
>> at
>> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:113)
>> ... 32 more
>>
>> On Thu, Jan 21, 2016 at 12:41 AM, Nick Dimiduk <nd...@apache.org>
>>> wrote:
>>>
>>>> I finally got to the bottom of things. There were two issues at play in
>>>> my particular environment.
>>>>
>>>> 1. An Ambari bug [0] means my spark-defaults.conf file was garbage. I
>>>> hardly thought of it when I hit the issue with MR job submission; its
>>>> impact on Spark was much more subtle.
>>>>
>>>> 2. YARN client version mismatch (Phoenix is compiled vs Apache 2.5.1
>>>> while my cluster is running HDP's 2.7.1 build), per my earlier email. Once
>>>> I'd worked around (1), I was able to work around (2) by setting
>>>> "/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar"
>>>> for both spark.driver.extraClassPath and spark.executor.extraClassPath. Per
>>>> the linked thread above, I believe this places the correct YARN client
>>>> version first in the classpath.
>>>>
>>>> With the above in place, I'm able to submit work vs the Phoenix tables
>>>> to the YARN cluster. Success!
>>>>
>>>> Ironically enough, I would not have been able to work around (2) if we
>>>> had PHOENIX-2535 in place. Food for thought in tackling that issue. It may
>>>> be worth while to ship a uberclient jar that is entirely without Hadoop (or
>>>> HBase) classes. I believe spark does this for Hadoop with their builds as
>>>> well.
>>>>
>>>> Thanks again for your help here Josh! I really appreciate it.
>>>>
>>>> [0]: https://issues.apache.org/jira/browse/AMBARI-14751
>>>>
>>>> On Wed, Jan 20, 2016 at 2:23 PM, Nick Dimiduk <nd...@apache.org>
>>>> wrote:
>>>>
>>>>> Well, I spoke too soon. It's working, but in local mode only. When I
>>>>> invoke `pyspark --master yarn` (or yarn-client), the submitted application
>>>>> goes from ACCEPTED to FAILED, with a NumberFormatException [0] in my
>>>>> container log. Now that Phoenix is on my classpath, I'm suspicious that the
>>>>> versions of YARN client libraries are incompatible. I found an old thread
>>>>> [1] with the same stack trace I'm seeing, similar conclusion. I tried
>>>>> setting spark.driver.extraClassPath and spark.executor.extraClassPath
>>>>> to /usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>> but that appears to have no impact.
>>>>>
>>>>> [0]:
>>>>> 16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception:
>>>>> java.lang.IllegalArgumentException: Invalid ContainerId:
>>>>> container_e07_1452901320122_0042_01_000001
>>>>> at
>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)
>>>>> at
>>>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>>>>> at
>>>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>>>> at
>>>>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
>>>>> Caused by: java.lang.NumberFormatException: For input string: "e07"
>>>>> at
>>>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>>>> at java.lang.Long.parseLong(Long.java:589)
>>>>> at java.lang.Long.parseLong(Long.java:631)
>>>>> at
>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>>>>> at
>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>>>>> ... 12 more
>>>>>
>>>>> [1]:
>>>>> http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=A@mail.gmail.com%3E
>>>>>
>>>>> On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <jm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> That's great to hear. Looking forward to the doc patch!
>>>>>>
>>>>>> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <nd...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Josh -- I deployed my updated phoenix build across the cluster,
>>>>>>> added the phoenix-client-spark.jar to configs on the whole cluster, and now
>>>>>>> basic dataframe access is now working. Let me see about updating the docs
>>>>>>> page to be more clear, I'll send a patch by you for review.
>>>>>>>
>>>>>>> Thanks a lot for the help!
>>>>>>> -n
>>>>>>>
>>>>>>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jm...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on
>>>>>>>> YARN as well. I suppose the JAR is probably shipped by YARN, though I don't
>>>>>>>> see any logging saying it, so I'm not certain how the nuts and bolts of
>>>>>>>> that work. By explicitly setting the classpath, we're bypassing Spark's
>>>>>>>> native JAR broadcast though.
>>>>>>>>
>>>>>>>> Taking a quick look at the config in Ambari (which ships the config
>>>>>>>> to each node after saving), in 'Custom spark-defaults' I have the following:
>>>>>>>>
>>>>>>>> spark.driver.extraClassPath ->
>>>>>>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>>>> spark.executor.extraClassPath ->
>>>>>>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>>>>
>>>>>>>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I
>>>>>>>> think that gets the Ambari generated hbase-site.xml in the classpath. Each
>>>>>>>> node has the custom phoenix-client-spark.jar installed to that same path as
>>>>>>>> well.
>>>>>>>>
>>>>>>>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>>>>>>>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>>>>>>>
>>>>>>>> or pyspark via:
>>>>>>>> /usr/hdp/current/spark-client/bin/pyspark
>>>>>>>>
>>>>>>>> I also do this as the Ambari-created 'spark' user, I think there
>>>>>>>> was some fun HDFS permission issue otherwise.
>>>>>>>>
>>>>>>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <nd...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers
>>>>>>>>> are colocated with RegionServers; all the hosts have everything. There are
>>>>>>>>> no spark workers to restart. You're sure it's not shipped by the YARN
>>>>>>>>> runtime?
>>>>>>>>>
>>>>>>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jm...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sadly, it needs to be installed onto each Spark worker (for now).
>>>>>>>>>> The executor config tells each Spark worker to look for that file to add to
>>>>>>>>>> its classpath, so once you have it installed, you'll probably need to
>>>>>>>>>> restart all the Spark workers.
>>>>>>>>>>
>>>>>>>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>>>>>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>>>>>>>> consistently see is fine.
>>>>>>>>>>
>>>>>>>>>> One day we'll be able to have Spark ship the JAR around and use
>>>>>>>>>> it without this classpath nonsense, but we need to do some extra work on
>>>>>>>>>> the Phoenix side to make sure that Phoenix's calls to DriverManager
>>>>>>>>>> actually go through Spark's weird wrapper version of it.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <
>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <
>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> What version of Spark are you using?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install
>>>>>>>>>>> say, and the welcome message in the pyspark console agrees.
>>>>>>>>>>>
>>>>>>>>>>> Are there any other traces of exceptions anywhere?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> No other exceptions that I can find. YARN apparently doesn't
>>>>>>>>>>> want to aggregate spark's logs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Are all your Spark nodes set up to point to the same
>>>>>>>>>>>> phoenix-client-spark JAR?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes, as far as I can tell... though come to think of it, is that
>>>>>>>>>>> jar shipped by the spark runtime to workers, or is it loaded locally on
>>>>>>>>>>> each host? I only changed spark-defaults.conf on the client machine, the
>>>>>>>>>>> machine from which I submitted the job.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for taking a look Josh!
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <
>>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm doing my best to follow along with [0], but I'm hitting
>>>>>>>>>>>>> some stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My
>>>>>>>>>>>>> phoenix build is much newer, basically 4.6-branch + PHOENIX-2503,
>>>>>>>>>>>>> PHOENIX-2568. I'm using pyspark for now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>>>>>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>>>>>>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>>>>>>>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>>>>>>>>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>>>>>>>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>>>>>>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>>>>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>>>>>>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>>>>>>>>>>> two netty versions in that lib directory...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>>> -n
>>>>>>>>>>>>>
>>>>>>>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>>> [1]:
>>>>>>>>>>>>>
>>>>>>>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>>>>>>>> at
>>>>>>>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>>>>>>>> jamestaylor@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <
>>>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>>>>>>>>> committed for 4.7.0 and the docs have been updated to include these samples
>>>>>>>>>>>>>>> for PySpark users.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Josh
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <
>>>>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think this used to work, and will again once PHOENIX-2503
>>>>>>>>>>>>>>>> gets resolved. With the Spark DataFrame support, all the necessary glue is
>>>>>>>>>>>>>>>> there for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>>>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>>>>   .load()
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> df.write \
>>>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>>>>   .save()
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, this should be added to the documentation. I hadn't
>>>>>>>>>>>>>>>> actually tried this till just now. :)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <
>>>>>>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Heya,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Has anyone any experience using phoenix-spark integration
>>>>>>>>>>>>>>>>> from pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat
>>>>>>>>>>>>>>>>> from pyspark, haven't tried extending it for phoenix. Maybe someone with
>>>>>>>>>>>>>>>>> more experience in pyspark knows better? Would be a great addition to our
>>>>>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [0]:
>>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by James Taylor <ja...@apache.org>.
FYI, Nick - do you know about Josh's fix for PHOENIX-2599? Does that help
here?

On Fri, Jan 22, 2016 at 4:32 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> On Thu, Jan 21, 2016 at 7:36 AM, Josh Mahonin <jm...@gmail.com> wrote:
>
>> Amazing, good work.
>>
>
> All I did was consume your code and configure my cluster. Thanks though :)
>
> FWIW, I've got a support case in with Hortonworks to get the phoenix-spark
>> integration working out of the box. Assuming it gets resolved, that'll
>> hopefully help keep these classpath-hell issues to a minimum going forward.
>>
>
> That's fine for HWX customers, but it doesn't solve the general problem
> community-side. Unless, of course, we assume the only Phoenix users are on
> HDP.
>
> Interesting point re: PHOENIX-2535. Spark does offer builds for specific
>> Hadoop versions, and also no Hadoop at all (with the assumption you'll
>> provide the necessary JARs). Phoenix is pretty tightly coupled with its own
>> HBase (and by extension, Hadoop) versions though... do you think it be
>> possible to work around (2) if you locally added the HDP Maven repo and
>> adjusted versions accordingly? I've had some success with that in other
>> projects, though as I recall when I tried it with Phoenix I ran into a snag
>> trying to resolve some private transitive dependency of Hadoop.
>>
>
> I could specify Hadoop version and build Phoenix locally to remove the
> issue. It would work for me because I happen to be packaging my own Phoenix
> this week, but it doesn't help for the Apache releases. Phoenix _is_
> tightly coupled to HBase versions, but I don't think HBase versions are
> that tightly coupled to Hadoop versions. We build HBase 1.x releases
> against a long list of Hadoop releases as part of our usual build. I think
> what HBase consumes re: Hadoop API's is pretty limited.
>
> Now that you have Spark working you can start hitting real bugs! If you
>> haven't backported the full patchset, you might want to take a look at the
>> phoenix-spark history [1], there's been a lot of churn there, especially
>> with regards to the DataFrame API.
>>
>
> Yeah, speaking of which... :)
>
> I find this integration is basically unusable when the underlying HBase
> table partitions are in flux; i.e., when you have data being loaded
> concurrent to query. Spark RDD's assume stable partitions, multiple queries
> against a single DataFrame, or even a single query/DataFrame with lots of
> underlying splits and not enough workers, will inevitably fail
> with ConcurrentModificationException like [0].
>
> I think in HBase/MR integration, we define split points for the job as
> region boundaries initially, but then we're just using the regular API, so
> repartitions are handled transiently by the client layer. I need to dig
> into this Phoenix/Spark stuff to see if we can do anything similar here.
>
> Thanks again Josh,
> -n
>
> [0]:
>
> java.lang.RuntimeException: java.util.ConcurrentModificationException
> at
> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:125)
> at
> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(PhoenixInputFormat.java:69)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:131)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> at org.apache.phoenix.spark.PhoenixRDD.compute(PhoenixRDD.scala:57)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.Hashtable$Enumerator.next(Hashtable.java:1367)
> at org.apache.hadoop.conf.Configuration.iterator(Configuration.java:2154)
> at
> org.apache.phoenix.util.PropertiesUtil.extractProperties(PropertiesUtil.java:52)
> at
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:60)
> at
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
> at
> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:278)
> at
> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectStatement(PhoenixConfigurationUtil.java:306)
> at
> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:113)
> ... 32 more
>
> On Thu, Jan 21, 2016 at 12:41 AM, Nick Dimiduk <nd...@apache.org>
>> wrote:
>>
>>> I finally got to the bottom of things. There were two issues at play in
>>> my particular environment.
>>>
>>> 1. An Ambari bug [0] means my spark-defaults.conf file was garbage. I
>>> hardly thought of it when I hit the issue with MR job submission; its
>>> impact on Spark was much more subtle.
>>>
>>> 2. YARN client version mismatch (Phoenix is compiled vs Apache 2.5.1
>>> while my cluster is running HDP's 2.7.1 build), per my earlier email. Once
>>> I'd worked around (1), I was able to work around (2) by setting
>>> "/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar"
>>> for both spark.driver.extraClassPath and spark.executor.extraClassPath. Per
>>> the linked thread above, I believe this places the correct YARN client
>>> version first in the classpath.
>>>
>>> With the above in place, I'm able to submit work vs the Phoenix tables
>>> to the YARN cluster. Success!
>>>
>>> Ironically enough, I would not have been able to work around (2) if we
>>> had PHOENIX-2535 in place. Food for thought in tackling that issue. It may
>>> be worth while to ship a uberclient jar that is entirely without Hadoop (or
>>> HBase) classes. I believe spark does this for Hadoop with their builds as
>>> well.
>>>
>>> Thanks again for your help here Josh! I really appreciate it.
>>>
>>> [0]: https://issues.apache.org/jira/browse/AMBARI-14751
>>>
>>> On Wed, Jan 20, 2016 at 2:23 PM, Nick Dimiduk <nd...@apache.org>
>>> wrote:
>>>
>>>> Well, I spoke too soon. It's working, but in local mode only. When I
>>>> invoke `pyspark --master yarn` (or yarn-client), the submitted application
>>>> goes from ACCEPTED to FAILED, with a NumberFormatException [0] in my
>>>> container log. Now that Phoenix is on my classpath, I'm suspicious that the
>>>> versions of YARN client libraries are incompatible. I found an old thread
>>>> [1] with the same stack trace I'm seeing, similar conclusion. I tried
>>>> setting spark.driver.extraClassPath and spark.executor.extraClassPath
>>>> to /usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>> but that appears to have no impact.
>>>>
>>>> [0]:
>>>> 16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception:
>>>> java.lang.IllegalArgumentException: Invalid ContainerId:
>>>> container_e07_1452901320122_0042_01_000001
>>>> at
>>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
>>>> at
>>>> org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)
>>>> at
>>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>>>> at
>>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>>> at
>>>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)
>>>> at
>>>> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)
>>>> at
>>>> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
>>>> Caused by: java.lang.NumberFormatException: For input string: "e07"
>>>> at
>>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>>> at java.lang.Long.parseLong(Long.java:589)
>>>> at java.lang.Long.parseLong(Long.java:631)
>>>> at
>>>> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>>>> at
>>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>>>> ... 12 more
>>>>
>>>> [1]:
>>>> http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=A@mail.gmail.com%3E
>>>>
>>>> On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <jm...@gmail.com>
>>>> wrote:
>>>>
>>>>> That's great to hear. Looking forward to the doc patch!
>>>>>
>>>>> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <nd...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Josh -- I deployed my updated phoenix build across the cluster, added
>>>>>> the phoenix-client-spark.jar to configs on the whole cluster, and now basic
>>>>>> dataframe access is now working. Let me see about updating the docs page to
>>>>>> be more clear, I'll send a patch by you for review.
>>>>>>
>>>>>> Thanks a lot for the help!
>>>>>> -n
>>>>>>
>>>>>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on
>>>>>>> YARN as well. I suppose the JAR is probably shipped by YARN, though I don't
>>>>>>> see any logging saying it, so I'm not certain how the nuts and bolts of
>>>>>>> that work. By explicitly setting the classpath, we're bypassing Spark's
>>>>>>> native JAR broadcast though.
>>>>>>>
>>>>>>> Taking a quick look at the config in Ambari (which ships the config
>>>>>>> to each node after saving), in 'Custom spark-defaults' I have the following:
>>>>>>>
>>>>>>> spark.driver.extraClassPath ->
>>>>>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>>> spark.executor.extraClassPath ->
>>>>>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>>>
>>>>>>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I
>>>>>>> think that gets the Ambari generated hbase-site.xml in the classpath. Each
>>>>>>> node has the custom phoenix-client-spark.jar installed to that same path as
>>>>>>> well.
>>>>>>>
>>>>>>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>>>>>>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>>>>>>
>>>>>>> or pyspark via:
>>>>>>> /usr/hdp/current/spark-client/bin/pyspark
>>>>>>>
>>>>>>> I also do this as the Ambari-created 'spark' user, I think there was
>>>>>>> some fun HDFS permission issue otherwise.
>>>>>>>
>>>>>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <nd...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers
>>>>>>>> are colocated with RegionServers; all the hosts have everything. There are
>>>>>>>> no spark workers to restart. You're sure it's not shipped by the YARN
>>>>>>>> runtime?
>>>>>>>>
>>>>>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jm...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Sadly, it needs to be installed onto each Spark worker (for now).
>>>>>>>>> The executor config tells each Spark worker to look for that file to add to
>>>>>>>>> its classpath, so once you have it installed, you'll probably need to
>>>>>>>>> restart all the Spark workers.
>>>>>>>>>
>>>>>>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>>>>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>>>>>>> consistently see is fine.
>>>>>>>>>
>>>>>>>>> One day we'll be able to have Spark ship the JAR around and use it
>>>>>>>>> without this classpath nonsense, but we need to do some extra work on the
>>>>>>>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>>>>>>>>> through Spark's weird wrapper version of it.
>>>>>>>>>
>>>>>>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <ndimiduk@apache.org
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jmahonin@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> What version of Spark are you using?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install
>>>>>>>>>> say, and the welcome message in the pyspark console agrees.
>>>>>>>>>>
>>>>>>>>>> Are there any other traces of exceptions anywhere?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> No other exceptions that I can find. YARN apparently doesn't want
>>>>>>>>>> to aggregate spark's logs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Are all your Spark nodes set up to point to the same
>>>>>>>>>>> phoenix-client-spark JAR?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, as far as I can tell... though come to think of it, is that
>>>>>>>>>> jar shipped by the spark runtime to workers, or is it loaded locally on
>>>>>>>>>> each host? I only changed spark-defaults.conf on the client machine, the
>>>>>>>>>> machine from which I submitted the job.
>>>>>>>>>>
>>>>>>>>>> Thanks for taking a look Josh!
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <
>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm doing my best to follow along with [0], but I'm hitting
>>>>>>>>>>>> some stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My
>>>>>>>>>>>> phoenix build is much newer, basically 4.6-branch + PHOENIX-2503,
>>>>>>>>>>>> PHOENIX-2568. I'm using pyspark for now.
>>>>>>>>>>>>
>>>>>>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>>>>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>>>>>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>>>>>>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>>>>>>>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>>>>>>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>>>>>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>>>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>>>>>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>>>>>>>>>> two netty versions in that lib directory...
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>> -n
>>>>>>>>>>>>
>>>>>>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>> [1]:
>>>>>>>>>>>>
>>>>>>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>>>>>>> at
>>>>>>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>>>>>>> at
>>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>>>>>>> at
>>>>>>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>>>>>>> at
>>>>>>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>>>>>>> at
>>>>>>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>>>>>>> at
>>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>>>>>>> at
>>>>>>>>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>>>>>>> at
>>>>>>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>>>>>>> at
>>>>>>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>>>>>>> at
>>>>>>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>>>>>>> at
>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>>>>>> at
>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>>>>>>> jamestaylor@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <
>>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>>>>>>>> committed for 4.7.0 and the docs have been updated to include these samples
>>>>>>>>>>>>>> for PySpark users.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Josh
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <
>>>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think this used to work, and will again once PHOENIX-2503
>>>>>>>>>>>>>>> gets resolved. With the Spark DataFrame support, all the necessary glue is
>>>>>>>>>>>>>>> there for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>>>   .load()
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> df.write \
>>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>>>   .save()
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, this should be added to the documentation. I hadn't
>>>>>>>>>>>>>>> actually tried this till just now. :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <
>>>>>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Heya,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Has anyone any experience using phoenix-spark integration
>>>>>>>>>>>>>>>> from pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with more
>>>>>>>>>>>>>>>> experience in pyspark knows better? Would be a great addition to our
>>>>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [0]:
>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Nick Dimiduk <nd...@gmail.com>.
On Thu, Jan 21, 2016 at 7:36 AM, Josh Mahonin <jm...@gmail.com> wrote:

> Amazing, good work.
>

All I did was consume your code and configure my cluster. Thanks though :)

FWIW, I've got a support case in with Hortonworks to get the phoenix-spark
> integration working out of the box. Assuming it gets resolved, that'll
> hopefully help keep these classpath-hell issues to a minimum going forward.
>

That's fine for HWX customers, but it doesn't solve the general problem
community-side. Unless, of course, we assume the only Phoenix users are on
HDP.

Interesting point re: PHOENIX-2535. Spark does offer builds for specific
> Hadoop versions, and also no Hadoop at all (with the assumption you'll
> provide the necessary JARs). Phoenix is pretty tightly coupled with its own
> HBase (and by extension, Hadoop) versions though... do you think it be
> possible to work around (2) if you locally added the HDP Maven repo and
> adjusted versions accordingly? I've had some success with that in other
> projects, though as I recall when I tried it with Phoenix I ran into a snag
> trying to resolve some private transitive dependency of Hadoop.
>

I could specify Hadoop version and build Phoenix locally to remove the
issue. It would work for me because I happen to be packaging my own Phoenix
this week, but it doesn't help for the Apache releases. Phoenix _is_
tightly coupled to HBase versions, but I don't think HBase versions are
that tightly coupled to Hadoop versions. We build HBase 1.x releases
against a long list of Hadoop releases as part of our usual build. I think
what HBase consumes re: Hadoop API's is pretty limited.

Now that you have Spark working you can start hitting real bugs! If you
> haven't backported the full patchset, you might want to take a look at the
> phoenix-spark history [1], there's been a lot of churn there, especially
> with regards to the DataFrame API.
>

Yeah, speaking of which... :)

I find this integration is basically unusable when the underlying HBase
table partitions are in flux; i.e., when you have data being loaded
concurrent to query. Spark RDD's assume stable partitions, multiple queries
against a single DataFrame, or even a single query/DataFrame with lots of
underlying splits and not enough workers, will inevitably fail
with ConcurrentModificationException like [0].

I think in HBase/MR integration, we define split points for the job as
region boundaries initially, but then we're just using the regular API, so
repartitions are handled transiently by the client layer. I need to dig
into this Phoenix/Spark stuff to see if we can do anything similar here.

Thanks again Josh,
-n

[0]:

java.lang.RuntimeException: java.util.ConcurrentModificationException
at
org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:125)
at
org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(PhoenixInputFormat.java:69)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:131)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.phoenix.spark.PhoenixRDD.compute(PhoenixRDD.scala:57)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.ConcurrentModificationException
at java.util.Hashtable$Enumerator.next(Hashtable.java:1367)
at org.apache.hadoop.conf.Configuration.iterator(Configuration.java:2154)
at
org.apache.phoenix.util.PropertiesUtil.extractProperties(PropertiesUtil.java:52)
at
org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:60)
at
org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
at
org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:278)
at
org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectStatement(PhoenixConfigurationUtil.java:306)
at
org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:113)
... 32 more

On Thu, Jan 21, 2016 at 12:41 AM, Nick Dimiduk <nd...@apache.org> wrote:
>
>> I finally got to the bottom of things. There were two issues at play in
>> my particular environment.
>>
>> 1. An Ambari bug [0] means my spark-defaults.conf file was garbage. I
>> hardly thought of it when I hit the issue with MR job submission; its
>> impact on Spark was much more subtle.
>>
>> 2. YARN client version mismatch (Phoenix is compiled vs Apache 2.5.1
>> while my cluster is running HDP's 2.7.1 build), per my earlier email. Once
>> I'd worked around (1), I was able to work around (2) by setting
>> "/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar"
>> for both spark.driver.extraClassPath and spark.executor.extraClassPath. Per
>> the linked thread above, I believe this places the correct YARN client
>> version first in the classpath.
>>
>> With the above in place, I'm able to submit work vs the Phoenix tables to
>> the YARN cluster. Success!
>>
>> Ironically enough, I would not have been able to work around (2) if we
>> had PHOENIX-2535 in place. Food for thought in tackling that issue. It may
>> be worth while to ship a uberclient jar that is entirely without Hadoop (or
>> HBase) classes. I believe spark does this for Hadoop with their builds as
>> well.
>>
>> Thanks again for your help here Josh! I really appreciate it.
>>
>> [0]: https://issues.apache.org/jira/browse/AMBARI-14751
>>
>> On Wed, Jan 20, 2016 at 2:23 PM, Nick Dimiduk <nd...@apache.org>
>> wrote:
>>
>>> Well, I spoke too soon. It's working, but in local mode only. When I
>>> invoke `pyspark --master yarn` (or yarn-client), the submitted application
>>> goes from ACCEPTED to FAILED, with a NumberFormatException [0] in my
>>> container log. Now that Phoenix is on my classpath, I'm suspicious that the
>>> versions of YARN client libraries are incompatible. I found an old thread
>>> [1] with the same stack trace I'm seeing, similar conclusion. I tried
>>> setting spark.driver.extraClassPath and spark.executor.extraClassPath
>>> to /usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>> but that appears to have no impact.
>>>
>>> [0]:
>>> 16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception:
>>> java.lang.IllegalArgumentException: Invalid ContainerId:
>>> container_e07_1452901320122_0042_01_000001
>>> at
>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
>>> at
>>> org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)
>>> at
>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>>> at
>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>> at
>>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)
>>> at
>>> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)
>>> at
>>> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
>>> Caused by: java.lang.NumberFormatException: For input string: "e07"
>>> at
>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>> at java.lang.Long.parseLong(Long.java:589)
>>> at java.lang.Long.parseLong(Long.java:631)
>>> at
>>> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>>> at
>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>>> ... 12 more
>>>
>>> [1]:
>>> http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=A@mail.gmail.com%3E
>>>
>>> On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <jm...@gmail.com>
>>> wrote:
>>>
>>>> That's great to hear. Looking forward to the doc patch!
>>>>
>>>> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <nd...@apache.org>
>>>> wrote:
>>>>
>>>>> Josh -- I deployed my updated phoenix build across the cluster, added
>>>>> the phoenix-client-spark.jar to configs on the whole cluster, and now basic
>>>>> dataframe access is now working. Let me see about updating the docs page to
>>>>> be more clear, I'll send a patch by you for review.
>>>>>
>>>>> Thanks a lot for the help!
>>>>> -n
>>>>>
>>>>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on
>>>>>> YARN as well. I suppose the JAR is probably shipped by YARN, though I don't
>>>>>> see any logging saying it, so I'm not certain how the nuts and bolts of
>>>>>> that work. By explicitly setting the classpath, we're bypassing Spark's
>>>>>> native JAR broadcast though.
>>>>>>
>>>>>> Taking a quick look at the config in Ambari (which ships the config
>>>>>> to each node after saving), in 'Custom spark-defaults' I have the following:
>>>>>>
>>>>>> spark.driver.extraClassPath ->
>>>>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>> spark.executor.extraClassPath ->
>>>>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>>
>>>>>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I
>>>>>> think that gets the Ambari generated hbase-site.xml in the classpath. Each
>>>>>> node has the custom phoenix-client-spark.jar installed to that same path as
>>>>>> well.
>>>>>>
>>>>>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>>>>>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>>>>>
>>>>>> or pyspark via:
>>>>>> /usr/hdp/current/spark-client/bin/pyspark
>>>>>>
>>>>>> I also do this as the Ambari-created 'spark' user, I think there was
>>>>>> some fun HDFS permission issue otherwise.
>>>>>>
>>>>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <nd...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers
>>>>>>> are colocated with RegionServers; all the hosts have everything. There are
>>>>>>> no spark workers to restart. You're sure it's not shipped by the YARN
>>>>>>> runtime?
>>>>>>>
>>>>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jm...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sadly, it needs to be installed onto each Spark worker (for now).
>>>>>>>> The executor config tells each Spark worker to look for that file to add to
>>>>>>>> its classpath, so once you have it installed, you'll probably need to
>>>>>>>> restart all the Spark workers.
>>>>>>>>
>>>>>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>>>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>>>>>> consistently see is fine.
>>>>>>>>
>>>>>>>> One day we'll be able to have Spark ship the JAR around and use it
>>>>>>>> without this classpath nonsense, but we need to do some extra work on the
>>>>>>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>>>>>>>> through Spark's weird wrapper version of it.
>>>>>>>>
>>>>>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <nd...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jm...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> What version of Spark are you using?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install
>>>>>>>>> say, and the welcome message in the pyspark console agrees.
>>>>>>>>>
>>>>>>>>> Are there any other traces of exceptions anywhere?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> No other exceptions that I can find. YARN apparently doesn't want
>>>>>>>>> to aggregate spark's logs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Are all your Spark nodes set up to point to the same
>>>>>>>>>> phoenix-client-spark JAR?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, as far as I can tell... though come to think of it, is that
>>>>>>>>> jar shipped by the spark runtime to workers, or is it loaded locally on
>>>>>>>>> each host? I only changed spark-defaults.conf on the client machine, the
>>>>>>>>> machine from which I submitted the job.
>>>>>>>>>
>>>>>>>>> Thanks for taking a look Josh!
>>>>>>>>>
>>>>>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <ndimiduk@apache.org
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi guys,
>>>>>>>>>>>
>>>>>>>>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>>>>>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>>>>>>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>>>>>>>>>> using pyspark for now.
>>>>>>>>>>>
>>>>>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>>>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>>>>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>>>>>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>>>>>>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>>>>>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>>>>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>>>>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>>>>>>>>> two netty versions in that lib directory...
>>>>>>>>>>>
>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>> -n
>>>>>>>>>>>
>>>>>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>> [1]:
>>>>>>>>>>>
>>>>>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>>>>>> at
>>>>>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>>>>>> at
>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>>>>>> at
>>>>>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>>>>>> at
>>>>>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>>>>>> at
>>>>>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>>>>>> at
>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>>>>>> at
>>>>>>>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>>>>>> at
>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>>>>> at
>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>>>>>> jamestaylor@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <
>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>>>>>>> committed for 4.7.0 and the docs have been updated to include these samples
>>>>>>>>>>>>> for PySpark users.
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> Josh
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <
>>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think this used to work, and will again once PHOENIX-2503
>>>>>>>>>>>>>> gets resolved. With the Spark DataFrame support, all the necessary glue is
>>>>>>>>>>>>>> there for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>>   .load()
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> df.write \
>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>>   .save()
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, this should be added to the documentation. I hadn't
>>>>>>>>>>>>>> actually tried this till just now. :)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <
>>>>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Heya,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Has anyone any experience using phoenix-spark integration
>>>>>>>>>>>>>>> from pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with more
>>>>>>>>>>>>>>> experience in pyspark knows better? Would be a great addition to our
>>>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [0]:
>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Josh Mahonin <jm...@gmail.com>.
Amazing, good work.

FWIW, I've got a support case in with Hortonworks to get the phoenix-spark
integration working out of the box. Assuming it gets resolved, that'll
hopefully help keep these classpath-hell issues to a minimum going forward.

Interesting point re: PHOENIX-2535. Spark does offer builds for specific
Hadoop versions, and also no Hadoop at all (with the assumption you'll
provide the necessary JARs). Phoenix is pretty tightly coupled with its own
HBase (and by extension, Hadoop) versions though... do you think it be
possible to work around (2) if you locally added the HDP Maven repo and
adjusted versions accordingly? I've had some success with that in other
projects, though as I recall when I tried it with Phoenix I ran into a snag
trying to resolve some private transitive dependency of Hadoop.

Now that you have Spark working you can start hitting real bugs! If you
haven't backported the full patchset, you might want to take a look at the
phoenix-spark history [1], there's been a lot of churn there, especially
with regards to the DataFrame API.

Josh

[1] https://github.com/apache/phoenix/commits/master/phoenix-spark

On Thu, Jan 21, 2016 at 12:41 AM, Nick Dimiduk <nd...@apache.org> wrote:

> I finally got to the bottom of things. There were two issues at play in my
> particular environment.
>
> 1. An Ambari bug [0] means my spark-defaults.conf file was garbage. I
> hardly thought of it when I hit the issue with MR job submission; its
> impact on Spark was much more subtle.
>
> 2. YARN client version mismatch (Phoenix is compiled vs Apache 2.5.1 while
> my cluster is running HDP's 2.7.1 build), per my earlier email. Once I'd
> worked around (1), I was able to work around (2) by setting
> "/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar"
> for both spark.driver.extraClassPath and spark.executor.extraClassPath. Per
> the linked thread above, I believe this places the correct YARN client
> version first in the classpath.
>
> With the above in place, I'm able to submit work vs the Phoenix tables to
> the YARN cluster. Success!
>
> Ironically enough, I would not have been able to work around (2) if we
> had PHOENIX-2535 in place. Food for thought in tackling that issue. It may
> be worth while to ship a uberclient jar that is entirely without Hadoop (or
> HBase) classes. I believe spark does this for Hadoop with their builds as
> well.
>
> Thanks again for your help here Josh! I really appreciate it.
>
> [0]: https://issues.apache.org/jira/browse/AMBARI-14751
>
> On Wed, Jan 20, 2016 at 2:23 PM, Nick Dimiduk <nd...@apache.org> wrote:
>
>> Well, I spoke too soon. It's working, but in local mode only. When I
>> invoke `pyspark --master yarn` (or yarn-client), the submitted application
>> goes from ACCEPTED to FAILED, with a NumberFormatException [0] in my
>> container log. Now that Phoenix is on my classpath, I'm suspicious that the
>> versions of YARN client libraries are incompatible. I found an old thread
>> [1] with the same stack trace I'm seeing, similar conclusion. I tried
>> setting spark.driver.extraClassPath and spark.executor.extraClassPath
>> to /usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>> but that appears to have no impact.
>>
>> [0]:
>> 16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception:
>> java.lang.IllegalArgumentException: Invalid ContainerId:
>> container_e07_1452901320122_0042_01_000001
>> at
>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
>> at
>> org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>> at
>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)
>> at
>> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)
>> at
>> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
>> Caused by: java.lang.NumberFormatException: For input string: "e07"
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>> at java.lang.Long.parseLong(Long.java:589)
>> at java.lang.Long.parseLong(Long.java:631)
>> at
>> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>> at
>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>> ... 12 more
>>
>> [1]:
>> http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=A@mail.gmail.com%3E
>>
>> On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <jm...@gmail.com> wrote:
>>
>>> That's great to hear. Looking forward to the doc patch!
>>>
>>> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <nd...@apache.org>
>>> wrote:
>>>
>>>> Josh -- I deployed my updated phoenix build across the cluster, added
>>>> the phoenix-client-spark.jar to configs on the whole cluster, and now basic
>>>> dataframe access is now working. Let me see about updating the docs page to
>>>> be more clear, I'll send a patch by you for review.
>>>>
>>>> Thanks a lot for the help!
>>>> -n
>>>>
>>>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jm...@gmail.com>
>>>> wrote:
>>>>
>>>>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on
>>>>> YARN as well. I suppose the JAR is probably shipped by YARN, though I don't
>>>>> see any logging saying it, so I'm not certain how the nuts and bolts of
>>>>> that work. By explicitly setting the classpath, we're bypassing Spark's
>>>>> native JAR broadcast though.
>>>>>
>>>>> Taking a quick look at the config in Ambari (which ships the config to
>>>>> each node after saving), in 'Custom spark-defaults' I have the following:
>>>>>
>>>>> spark.driver.extraClassPath ->
>>>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>> spark.executor.extraClassPath ->
>>>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>
>>>>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think
>>>>> that gets the Ambari generated hbase-site.xml in the classpath. Each node
>>>>> has the custom phoenix-client-spark.jar installed to that same path as well.
>>>>>
>>>>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>>>>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>>>>
>>>>> or pyspark via:
>>>>> /usr/hdp/current/spark-client/bin/pyspark
>>>>>
>>>>> I also do this as the Ambari-created 'spark' user, I think there was
>>>>> some fun HDFS permission issue otherwise.
>>>>>
>>>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <nd...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
>>>>>> colocated with RegionServers; all the hosts have everything. There are no
>>>>>> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>>>>>>
>>>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sadly, it needs to be installed onto each Spark worker (for now).
>>>>>>> The executor config tells each Spark worker to look for that file to add to
>>>>>>> its classpath, so once you have it installed, you'll probably need to
>>>>>>> restart all the Spark workers.
>>>>>>>
>>>>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>>>>> consistently see is fine.
>>>>>>>
>>>>>>> One day we'll be able to have Spark ship the JAR around and use it
>>>>>>> without this classpath nonsense, but we need to do some extra work on the
>>>>>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>>>>>>> through Spark's weird wrapper version of it.
>>>>>>>
>>>>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <nd...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jm...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> What version of Spark are you using?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say,
>>>>>>>> and the welcome message in the pyspark console agrees.
>>>>>>>>
>>>>>>>> Are there any other traces of exceptions anywhere?
>>>>>>>>>
>>>>>>>>
>>>>>>>> No other exceptions that I can find. YARN apparently doesn't want
>>>>>>>> to aggregate spark's logs.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Are all your Spark nodes set up to point to the same
>>>>>>>>> phoenix-client-spark JAR?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, as far as I can tell... though come to think of it, is that
>>>>>>>> jar shipped by the spark runtime to workers, or is it loaded locally on
>>>>>>>> each host? I only changed spark-defaults.conf on the client machine, the
>>>>>>>> machine from which I submitted the job.
>>>>>>>>
>>>>>>>> Thanks for taking a look Josh!
>>>>>>>>
>>>>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <nd...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi guys,
>>>>>>>>>>
>>>>>>>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>>>>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>>>>>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>>>>>>>>> using pyspark for now.
>>>>>>>>>>
>>>>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>>>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>>>>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>>>>>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>>>>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>>>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>>>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>>>>>>>> two netty versions in that lib directory...
>>>>>>>>>>
>>>>>>>>>> Thanks a lot,
>>>>>>>>>> -n
>>>>>>>>>>
>>>>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>> [1]:
>>>>>>>>>>
>>>>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>>>>> at
>>>>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>>>>> at
>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>>>>> at
>>>>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>>>>> at
>>>>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>>>>> at
>>>>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>>>>> at
>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>>>>> at
>>>>>>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>>>>> at
>>>>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>>>>> at
>>>>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>>>>> at
>>>>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>>>>> at
>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>>>> at
>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>>>>> jamestaylor@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <
>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>>>>>> committed for 4.7.0 and the docs have been updated to include these samples
>>>>>>>>>>>> for PySpark users.
>>>>>>>>>>>>
>>>>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>>
>>>>>>>>>>>> Josh
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <
>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think this used to work, and will again once PHOENIX-2503
>>>>>>>>>>>>> gets resolved. With the Spark DataFrame support, all the necessary glue is
>>>>>>>>>>>>> there for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>   .load()
>>>>>>>>>>>>>
>>>>>>>>>>>>> And
>>>>>>>>>>>>>
>>>>>>>>>>>>> df.write \
>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>   .save()
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, this should be added to the documentation. I hadn't
>>>>>>>>>>>>> actually tried this till just now. :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <
>>>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Heya,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Has anyone any experience using phoenix-spark integration
>>>>>>>>>>>>>> from pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with more
>>>>>>>>>>>>>> experience in pyspark knows better? Would be a great addition to our
>>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [0]:
>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Nick Dimiduk <nd...@apache.org>.
I finally got to the bottom of things. There were two issues at play in my
particular environment.

1. An Ambari bug [0] means my spark-defaults.conf file was garbage. I
hardly thought of it when I hit the issue with MR job submission; its
impact on Spark was much more subtle.

2. YARN client version mismatch (Phoenix is compiled vs Apache 2.5.1 while
my cluster is running HDP's 2.7.1 build), per my earlier email. Once I'd
worked around (1), I was able to work around (2) by setting
"/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar"
for both spark.driver.extraClassPath and spark.executor.extraClassPath. Per
the linked thread above, I believe this places the correct YARN client
version first in the classpath.

With the above in place, I'm able to submit work vs the Phoenix tables to
the YARN cluster. Success!

Ironically enough, I would not have been able to work around (2) if we
had PHOENIX-2535 in place. Food for thought in tackling that issue. It may
be worth while to ship a uberclient jar that is entirely without Hadoop (or
HBase) classes. I believe spark does this for Hadoop with their builds as
well.

Thanks again for your help here Josh! I really appreciate it.

[0]: https://issues.apache.org/jira/browse/AMBARI-14751

On Wed, Jan 20, 2016 at 2:23 PM, Nick Dimiduk <nd...@apache.org> wrote:

> Well, I spoke too soon. It's working, but in local mode only. When I
> invoke `pyspark --master yarn` (or yarn-client), the submitted application
> goes from ACCEPTED to FAILED, with a NumberFormatException [0] in my
> container log. Now that Phoenix is on my classpath, I'm suspicious that the
> versions of YARN client libraries are incompatible. I found an old thread
> [1] with the same stack trace I'm seeing, similar conclusion. I tried
> setting spark.driver.extraClassPath and spark.executor.extraClassPath
> to /usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
> but that appears to have no impact.
>
> [0]:
> 16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception:
> java.lang.IllegalArgumentException: Invalid ContainerId:
> container_e07_1452901320122_0042_01_000001
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
> at
> org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)
> at
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)
> at
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> Caused by: java.lang.NumberFormatException: For input string: "e07"
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Long.parseLong(Long.java:589)
> at java.lang.Long.parseLong(Long.java:631)
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
> ... 12 more
>
> [1]:
> http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=A@mail.gmail.com%3E
>
> On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <jm...@gmail.com> wrote:
>
>> That's great to hear. Looking forward to the doc patch!
>>
>> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <nd...@apache.org>
>> wrote:
>>
>>> Josh -- I deployed my updated phoenix build across the cluster, added
>>> the phoenix-client-spark.jar to configs on the whole cluster, and now basic
>>> dataframe access is now working. Let me see about updating the docs page to
>>> be more clear, I'll send a patch by you for review.
>>>
>>> Thanks a lot for the help!
>>> -n
>>>
>>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jm...@gmail.com>
>>> wrote:
>>>
>>>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on
>>>> YARN as well. I suppose the JAR is probably shipped by YARN, though I don't
>>>> see any logging saying it, so I'm not certain how the nuts and bolts of
>>>> that work. By explicitly setting the classpath, we're bypassing Spark's
>>>> native JAR broadcast though.
>>>>
>>>> Taking a quick look at the config in Ambari (which ships the config to
>>>> each node after saving), in 'Custom spark-defaults' I have the following:
>>>>
>>>> spark.driver.extraClassPath ->
>>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>> spark.executor.extraClassPath ->
>>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>
>>>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think
>>>> that gets the Ambari generated hbase-site.xml in the classpath. Each node
>>>> has the custom phoenix-client-spark.jar installed to that same path as well.
>>>>
>>>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>>>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>>>
>>>> or pyspark via:
>>>> /usr/hdp/current/spark-client/bin/pyspark
>>>>
>>>> I also do this as the Ambari-created 'spark' user, I think there was
>>>> some fun HDFS permission issue otherwise.
>>>>
>>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <nd...@apache.org>
>>>> wrote:
>>>>
>>>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
>>>>> colocated with RegionServers; all the hosts have everything. There are no
>>>>> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>>>>>
>>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Sadly, it needs to be installed onto each Spark worker (for now). The
>>>>>> executor config tells each Spark worker to look for that file to add to its
>>>>>> classpath, so once you have it installed, you'll probably need to restart
>>>>>> all the Spark workers.
>>>>>>
>>>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>>>> consistently see is fine.
>>>>>>
>>>>>> One day we'll be able to have Spark ship the JAR around and use it
>>>>>> without this classpath nonsense, but we need to do some extra work on the
>>>>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>>>>>> through Spark's weird wrapper version of it.
>>>>>>
>>>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <nd...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jm...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> What version of Spark are you using?
>>>>>>>>
>>>>>>>
>>>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say,
>>>>>>> and the welcome message in the pyspark console agrees.
>>>>>>>
>>>>>>> Are there any other traces of exceptions anywhere?
>>>>>>>>
>>>>>>>
>>>>>>> No other exceptions that I can find. YARN apparently doesn't want to
>>>>>>> aggregate spark's logs.
>>>>>>>
>>>>>>>
>>>>>>>> Are all your Spark nodes set up to point to the same
>>>>>>>> phoenix-client-spark JAR?
>>>>>>>>
>>>>>>>
>>>>>>> Yes, as far as I can tell... though come to think of it, is that jar
>>>>>>> shipped by the spark runtime to workers, or is it loaded locally on each
>>>>>>> host? I only changed spark-defaults.conf on the client machine, the machine
>>>>>>> from which I submitted the job.
>>>>>>>
>>>>>>> Thanks for taking a look Josh!
>>>>>>>
>>>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <nd...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>>>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>>>>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>>>>>>>> using pyspark for now.
>>>>>>>>>
>>>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>>>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>>>>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>>>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>>>>>>> two netty versions in that lib directory...
>>>>>>>>>
>>>>>>>>> Thanks a lot,
>>>>>>>>> -n
>>>>>>>>>
>>>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>>>> [1]:
>>>>>>>>>
>>>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>>>> at
>>>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>>>> at
>>>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>>>> at
>>>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>>>> at
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>>> at
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>>>> jamestaylor@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>>>
>>>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jmahonin@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>>>>> committed for 4.7.0 and the docs have been updated to include these samples
>>>>>>>>>>> for PySpark users.
>>>>>>>>>>>
>>>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>
>>>>>>>>>>> Josh
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <
>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>>
>>>>>>>>>>>> I think this used to work, and will again once PHOENIX-2503
>>>>>>>>>>>> gets resolved. With the Spark DataFrame support, all the necessary glue is
>>>>>>>>>>>> there for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>>>>>>>
>>>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>   .load()
>>>>>>>>>>>>
>>>>>>>>>>>> And
>>>>>>>>>>>>
>>>>>>>>>>>> df.write \
>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>   .save()
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, this should be added to the documentation. I hadn't
>>>>>>>>>>>> actually tried this till just now. :)
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <
>>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Heya,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>>>>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with more
>>>>>>>>>>>>> experience in pyspark knows better? Would be a great addition to our
>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>
>>>>>>>>>>>>> [0]:
>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Nick Dimiduk <nd...@apache.org>.
Well, I spoke too soon. It's working, but in local mode only. When I invoke
`pyspark --master yarn` (or yarn-client), the submitted application goes
from ACCEPTED to FAILED, with a NumberFormatException [0] in my container
log. Now that Phoenix is on my classpath, I'm suspicious that the versions
of YARN client libraries are incompatible. I found an old thread [1] with
the same stack trace I'm seeing, similar conclusion. I tried setting
spark.driver.extraClassPath and spark.executor.extraClassPath
to /usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
but that appears to have no impact.

[0]:
16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception:
java.lang.IllegalArgumentException: Invalid ContainerId:
container_e07_1452901320122_0042_01_000001
at
org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
at
org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)
at
org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)
at
org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Caused by: java.lang.NumberFormatException: For input string: "e07"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at
org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
at
org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
... 12 more

[1]:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=A@mail.gmail.com%3E

On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <jm...@gmail.com> wrote:

> That's great to hear. Looking forward to the doc patch!
>
> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <nd...@apache.org> wrote:
>
>> Josh -- I deployed my updated phoenix build across the cluster, added the
>> phoenix-client-spark.jar to configs on the whole cluster, and now basic
>> dataframe access is now working. Let me see about updating the docs page to
>> be more clear, I'll send a patch by you for review.
>>
>> Thanks a lot for the help!
>> -n
>>
>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jm...@gmail.com> wrote:
>>
>>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on YARN
>>> as well. I suppose the JAR is probably shipped by YARN, though I don't see
>>> any logging saying it, so I'm not certain how the nuts and bolts of that
>>> work. By explicitly setting the classpath, we're bypassing Spark's native
>>> JAR broadcast though.
>>>
>>> Taking a quick look at the config in Ambari (which ships the config to
>>> each node after saving), in 'Custom spark-defaults' I have the following:
>>>
>>> spark.driver.extraClassPath ->
>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>> spark.executor.extraClassPath ->
>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>
>>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think
>>> that gets the Ambari generated hbase-site.xml in the classpath. Each node
>>> has the custom phoenix-client-spark.jar installed to that same path as well.
>>>
>>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>>
>>> or pyspark via:
>>> /usr/hdp/current/spark-client/bin/pyspark
>>>
>>> I also do this as the Ambari-created 'spark' user, I think there was
>>> some fun HDFS permission issue otherwise.
>>>
>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <nd...@apache.org>
>>> wrote:
>>>
>>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
>>>> colocated with RegionServers; all the hosts have everything. There are no
>>>> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>>>>
>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jm...@gmail.com>
>>>> wrote:
>>>>
>>>>> Sadly, it needs to be installed onto each Spark worker (for now). The
>>>>> executor config tells each Spark worker to look for that file to add to its
>>>>> classpath, so once you have it installed, you'll probably need to restart
>>>>> all the Spark workers.
>>>>>
>>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>>> consistently see is fine.
>>>>>
>>>>> One day we'll be able to have Spark ship the JAR around and use it
>>>>> without this classpath nonsense, but we need to do some extra work on the
>>>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>>>>> through Spark's weird wrapper version of it.
>>>>>
>>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <nd...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> What version of Spark are you using?
>>>>>>>
>>>>>>
>>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say,
>>>>>> and the welcome message in the pyspark console agrees.
>>>>>>
>>>>>> Are there any other traces of exceptions anywhere?
>>>>>>>
>>>>>>
>>>>>> No other exceptions that I can find. YARN apparently doesn't want to
>>>>>> aggregate spark's logs.
>>>>>>
>>>>>>
>>>>>>> Are all your Spark nodes set up to point to the same
>>>>>>> phoenix-client-spark JAR?
>>>>>>>
>>>>>>
>>>>>> Yes, as far as I can tell... though come to think of it, is that jar
>>>>>> shipped by the spark runtime to workers, or is it loaded locally on each
>>>>>> host? I only changed spark-defaults.conf on the client machine, the machine
>>>>>> from which I submitted the job.
>>>>>>
>>>>>> Thanks for taking a look Josh!
>>>>>>
>>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <nd...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>>>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>>>>>>> using pyspark for now.
>>>>>>>>
>>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>>>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>>>>>> two netty versions in that lib directory...
>>>>>>>>
>>>>>>>> Thanks a lot,
>>>>>>>> -n
>>>>>>>>
>>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>>> [1]:
>>>>>>>>
>>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>>> at
>>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>>> at
>>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>>> at
>>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>>> at
>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>> at
>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>>> jamestaylor@apache.org> wrote:
>>>>>>>>
>>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>>
>>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jm...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>>>> committed for 4.7.0 and the docs have been updated to include these samples
>>>>>>>>>> for PySpark users.
>>>>>>>>>>
>>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>
>>>>>>>>>> Josh
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jmahonin@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>
>>>>>>>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>>>>>>>> resolved. With the Spark DataFrame support, all the necessary glue is there
>>>>>>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>>>>>>
>>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>   .load()
>>>>>>>>>>>
>>>>>>>>>>> And
>>>>>>>>>>>
>>>>>>>>>>> df.write \
>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>   .save()
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes, this should be added to the documentation. I hadn't
>>>>>>>>>>> actually tried this till just now. :)
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <
>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Heya,
>>>>>>>>>>>>
>>>>>>>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>>>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>>>>
>>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with more
>>>>>>>>>>>> experience in pyspark knows better? Would be a great addition to our
>>>>>>>>>>>> documentation.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Nick
>>>>>>>>>>>>
>>>>>>>>>>>> [0]:
>>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Josh Mahonin <jm...@gmail.com>.
That's great to hear. Looking forward to the doc patch!

On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <nd...@apache.org> wrote:

> Josh -- I deployed my updated phoenix build across the cluster, added the
> phoenix-client-spark.jar to configs on the whole cluster, and now basic
> dataframe access is now working. Let me see about updating the docs page to
> be more clear, I'll send a patch by you for review.
>
> Thanks a lot for the help!
> -n
>
> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jm...@gmail.com> wrote:
>
>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on YARN
>> as well. I suppose the JAR is probably shipped by YARN, though I don't see
>> any logging saying it, so I'm not certain how the nuts and bolts of that
>> work. By explicitly setting the classpath, we're bypassing Spark's native
>> JAR broadcast though.
>>
>> Taking a quick look at the config in Ambari (which ships the config to
>> each node after saving), in 'Custom spark-defaults' I have the following:
>>
>> spark.driver.extraClassPath ->
>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>> spark.executor.extraClassPath ->
>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>
>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think
>> that gets the Ambari generated hbase-site.xml in the classpath. Each node
>> has the custom phoenix-client-spark.jar installed to that same path as well.
>>
>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>
>> or pyspark via:
>> /usr/hdp/current/spark-client/bin/pyspark
>>
>> I also do this as the Ambari-created 'spark' user, I think there was some
>> fun HDFS permission issue otherwise.
>>
>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <nd...@apache.org>
>> wrote:
>>
>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
>>> colocated with RegionServers; all the hosts have everything. There are no
>>> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>>>
>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jm...@gmail.com>
>>> wrote:
>>>
>>>> Sadly, it needs to be installed onto each Spark worker (for now). The
>>>> executor config tells each Spark worker to look for that file to add to its
>>>> classpath, so once you have it installed, you'll probably need to restart
>>>> all the Spark workers.
>>>>
>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>> consistently see is fine.
>>>>
>>>> One day we'll be able to have Spark ship the JAR around and use it
>>>> without this classpath nonsense, but we need to do some extra work on the
>>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>>>> through Spark's weird wrapper version of it.
>>>>
>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <nd...@apache.org>
>>>> wrote:
>>>>
>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What version of Spark are you using?
>>>>>>
>>>>>
>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say,
>>>>> and the welcome message in the pyspark console agrees.
>>>>>
>>>>> Are there any other traces of exceptions anywhere?
>>>>>>
>>>>>
>>>>> No other exceptions that I can find. YARN apparently doesn't want to
>>>>> aggregate spark's logs.
>>>>>
>>>>>
>>>>>> Are all your Spark nodes set up to point to the same
>>>>>> phoenix-client-spark JAR?
>>>>>>
>>>>>
>>>>> Yes, as far as I can tell... though come to think of it, is that jar
>>>>> shipped by the spark runtime to workers, or is it loaded locally on each
>>>>> host? I only changed spark-defaults.conf on the client machine, the machine
>>>>> from which I submitted the job.
>>>>>
>>>>> Thanks for taking a look Josh!
>>>>>
>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <nd...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>>>>>> using pyspark for now.
>>>>>>>
>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>>>>> two netty versions in that lib directory...
>>>>>>>
>>>>>>> Thanks a lot,
>>>>>>> -n
>>>>>>>
>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>> [1]:
>>>>>>>
>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>> at
>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>> at
>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>> at
>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>> at
>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>> at
>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>> at
>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>> at
>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>> at
>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>> at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>> jamestaylor@apache.org> wrote:
>>>>>>>
>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>
>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jm...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>>> committed for 4.7.0 and the docs have been updated to include these samples
>>>>>>>>> for PySpark users.
>>>>>>>>>
>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>
>>>>>>>>> Josh
>>>>>>>>>
>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jm...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Nick,
>>>>>>>>>>
>>>>>>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>>>>>>> resolved. With the Spark DataFrame support, all the necessary glue is there
>>>>>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>>>>>
>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>   .load()
>>>>>>>>>>
>>>>>>>>>> And
>>>>>>>>>>
>>>>>>>>>> df.write \
>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>   .save()
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, this should be added to the documentation. I hadn't actually
>>>>>>>>>> tried this till just now. :)
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <ndimiduk@apache.org
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Heya,
>>>>>>>>>>>
>>>>>>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>>>
>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with more
>>>>>>>>>>> experience in pyspark knows better? Would be a great addition to our
>>>>>>>>>>> documentation.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Nick
>>>>>>>>>>>
>>>>>>>>>>> [0]:
>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Nick Dimiduk <nd...@apache.org>.
Josh -- I deployed my updated phoenix build across the cluster, added the
phoenix-client-spark.jar to configs on the whole cluster, and now basic
dataframe access is now working. Let me see about updating the docs page to
be more clear, I'll send a patch by you for review.

Thanks a lot for the help!
-n

On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jm...@gmail.com> wrote:

> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on YARN
> as well. I suppose the JAR is probably shipped by YARN, though I don't see
> any logging saying it, so I'm not certain how the nuts and bolts of that
> work. By explicitly setting the classpath, we're bypassing Spark's native
> JAR broadcast though.
>
> Taking a quick look at the config in Ambari (which ships the config to
> each node after saving), in 'Custom spark-defaults' I have the following:
>
> spark.driver.extraClassPath ->
> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
> spark.executor.extraClassPath ->
> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>
> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think
> that gets the Ambari generated hbase-site.xml in the classpath. Each node
> has the custom phoenix-client-spark.jar installed to that same path as well.
>
> I can pop into regular spark-shell and load RDDs/DataFrames using:
> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>
> or pyspark via:
> /usr/hdp/current/spark-client/bin/pyspark
>
> I also do this as the Ambari-created 'spark' user, I think there was some
> fun HDFS permission issue otherwise.
>
> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <nd...@apache.org> wrote:
>
>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
>> colocated with RegionServers; all the hosts have everything. There are no
>> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>>
>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jm...@gmail.com> wrote:
>>
>>> Sadly, it needs to be installed onto each Spark worker (for now). The
>>> executor config tells each Spark worker to look for that file to add to its
>>> classpath, so once you have it installed, you'll probably need to restart
>>> all the Spark workers.
>>>
>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>> consistently see is fine.
>>>
>>> One day we'll be able to have Spark ship the JAR around and use it
>>> without this classpath nonsense, but we need to do some extra work on the
>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>>> through Spark's weird wrapper version of it.
>>>
>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <nd...@apache.org>
>>> wrote:
>>>
>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jm...@gmail.com>
>>>> wrote:
>>>>
>>>>> What version of Spark are you using?
>>>>>
>>>>
>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say, and
>>>> the welcome message in the pyspark console agrees.
>>>>
>>>> Are there any other traces of exceptions anywhere?
>>>>>
>>>>
>>>> No other exceptions that I can find. YARN apparently doesn't want to
>>>> aggregate spark's logs.
>>>>
>>>>
>>>>> Are all your Spark nodes set up to point to the same
>>>>> phoenix-client-spark JAR?
>>>>>
>>>>
>>>> Yes, as far as I can tell... though come to think of it, is that jar
>>>> shipped by the spark runtime to workers, or is it loaded locally on each
>>>> host? I only changed spark-defaults.conf on the client machine, the machine
>>>> from which I submitted the job.
>>>>
>>>> Thanks for taking a look Josh!
>>>>
>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <nd...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>>>>> using pyspark for now.
>>>>>>
>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>>>> two netty versions in that lib directory...
>>>>>>
>>>>>> Thanks a lot,
>>>>>> -n
>>>>>>
>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>> [1]:
>>>>>>
>>>>>> java.lang.IllegalStateException: unread block data
>>>>>> at
>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>> at
>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>> at
>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>> at
>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>> at
>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>> at
>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>> at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <jamestaylor@apache.org
>>>>>> > wrote:
>>>>>>
>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>
>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jm...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>> committed for 4.7.0 and the docs have been updated to include these samples
>>>>>>>> for PySpark users.
>>>>>>>>
>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>
>>>>>>>> Josh
>>>>>>>>
>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jm...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Nick,
>>>>>>>>>
>>>>>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>>>>>> resolved. With the Spark DataFrame support, all the necessary glue is there
>>>>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>>>>
>>>>>>>>> df = sqlContext.read \
>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>   .load()
>>>>>>>>>
>>>>>>>>> And
>>>>>>>>>
>>>>>>>>> df.write \
>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>   .save()
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, this should be added to the documentation. I hadn't actually
>>>>>>>>> tried this till just now. :)
>>>>>>>>>
>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <nd...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Heya,
>>>>>>>>>>
>>>>>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>>
>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with more
>>>>>>>>>> experience in pyspark knows better? Would be a great addition to our
>>>>>>>>>> documentation.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Nick
>>>>>>>>>>
>>>>>>>>>> [0]:
>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Josh Mahonin <jm...@gmail.com>.
Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on YARN as
well. I suppose the JAR is probably shipped by YARN, though I don't see any
logging saying it, so I'm not certain how the nuts and bolts of that work.
By explicitly setting the classpath, we're bypassing Spark's native JAR
broadcast though.

Taking a quick look at the config in Ambari (which ships the config to each
node after saving), in 'Custom spark-defaults' I have the following:

spark.driver.extraClassPath ->
/etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
spark.executor.extraClassPath ->
/usr/hdp/current/phoenix-client/phoenix-client-spark.jar

I'm not sure if the /etc/hbase/conf is necessarily needed, but I think that
gets the Ambari generated hbase-site.xml in the classpath. Each node has
the custom phoenix-client-spark.jar installed to that same path as well.

I can pop into regular spark-shell and load RDDs/DataFrames using:
/usr/hdp/current/spark-client/bin/spark-shell --master yarn-client

or pyspark via:
/usr/hdp/current/spark-client/bin/pyspark

I also do this as the Ambari-created 'spark' user, I think there was some
fun HDFS permission issue otherwise.

On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <nd...@apache.org> wrote:

> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
> colocated with RegionServers; all the hosts have everything. There are no
> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>
> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jm...@gmail.com> wrote:
>
>> Sadly, it needs to be installed onto each Spark worker (for now). The
>> executor config tells each Spark worker to look for that file to add to its
>> classpath, so once you have it installed, you'll probably need to restart
>> all the Spark workers.
>>
>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>> consistently see is fine.
>>
>> One day we'll be able to have Spark ship the JAR around and use it
>> without this classpath nonsense, but we need to do some extra work on the
>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>> through Spark's weird wrapper version of it.
>>
>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <nd...@apache.org>
>> wrote:
>>
>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jm...@gmail.com>
>>> wrote:
>>>
>>>> What version of Spark are you using?
>>>>
>>>
>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say, and
>>> the welcome message in the pyspark console agrees.
>>>
>>> Are there any other traces of exceptions anywhere?
>>>>
>>>
>>> No other exceptions that I can find. YARN apparently doesn't want to
>>> aggregate spark's logs.
>>>
>>>
>>>> Are all your Spark nodes set up to point to the same
>>>> phoenix-client-spark JAR?
>>>>
>>>
>>> Yes, as far as I can tell... though come to think of it, is that jar
>>> shipped by the spark runtime to workers, or is it loaded locally on each
>>> host? I only changed spark-defaults.conf on the client machine, the machine
>>> from which I submitted the job.
>>>
>>> Thanks for taking a look Josh!
>>>
>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <nd...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>>>> using pyspark for now.
>>>>>
>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>>> two netty versions in that lib directory...
>>>>>
>>>>> Thanks a lot,
>>>>> -n
>>>>>
>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>> [1]:
>>>>>
>>>>> java.lang.IllegalStateException: unread block data
>>>>> at
>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>> at
>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>> at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>> at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>> at
>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>> at
>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>> at
>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>
>>>>>
>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <ja...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>
>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>> committed for 4.7.0 and the docs have been updated to include these samples
>>>>>>> for PySpark users.
>>>>>>>
>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>
>>>>>>> Josh
>>>>>>>
>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jm...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Nick,
>>>>>>>>
>>>>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>>>>> resolved. With the Spark DataFrame support, all the necessary glue is there
>>>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>>>
>>>>>>>> df = sqlContext.read \
>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>   .load()
>>>>>>>>
>>>>>>>> And
>>>>>>>>
>>>>>>>> df.write \
>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>   .mode("overwrite") \
>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>   .save()
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, this should be added to the documentation. I hadn't actually
>>>>>>>> tried this till just now. :)
>>>>>>>>
>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <nd...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Heya,
>>>>>>>>>
>>>>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>
>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with more
>>>>>>>>> experience in pyspark knows better? Would be a great addition to our
>>>>>>>>> documentation.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>> [0]:
>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Nick Dimiduk <nd...@apache.org>.
I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
colocated with RegionServers; all the hosts have everything. There are no
spark workers to restart. You're sure it's not shipped by the YARN runtime?

On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jm...@gmail.com> wrote:

> Sadly, it needs to be installed onto each Spark worker (for now). The
> executor config tells each Spark worker to look for that file to add to its
> classpath, so once you have it installed, you'll probably need to restart
> all the Spark workers.
>
> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
> consistently see is fine.
>
> One day we'll be able to have Spark ship the JAR around and use it without
> this classpath nonsense, but we need to do some extra work on the Phoenix
> side to make sure that Phoenix's calls to DriverManager actually go through
> Spark's weird wrapper version of it.
>
> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <nd...@apache.org> wrote:
>
>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jm...@gmail.com> wrote:
>>
>>> What version of Spark are you using?
>>>
>>
>> Probably HDP's Spark 1.4.1; that's what the jars in my install say, and
>> the welcome message in the pyspark console agrees.
>>
>> Are there any other traces of exceptions anywhere?
>>>
>>
>> No other exceptions that I can find. YARN apparently doesn't want to
>> aggregate spark's logs.
>>
>>
>>> Are all your Spark nodes set up to point to the same
>>> phoenix-client-spark JAR?
>>>
>>
>> Yes, as far as I can tell... though come to think of it, is that jar
>> shipped by the spark runtime to workers, or is it loaded locally on each
>> host? I only changed spark-defaults.conf on the client machine, the machine
>> from which I submitted the job.
>>
>> Thanks for taking a look Josh!
>>
>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <nd...@apache.org>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>>> using pyspark for now.
>>>>
>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>> specifically which jars are needed? Or better still, how to debug this
>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>> two netty versions in that lib directory...
>>>>
>>>> Thanks a lot,
>>>> -n
>>>>
>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>> [1]:
>>>>
>>>> java.lang.IllegalStateException: unread block data
>>>> at
>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>> at
>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>> at
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>> at
>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>> at
>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>> at java.lang.Thread.run(Thread.java:745)
>>>>
>>>>
>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <ja...@apache.org>
>>>> wrote:
>>>>
>>>>> Thanks for remembering about the docs, Josh.
>>>>>
>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Just an update for anyone interested, PHOENIX-2503 was just committed
>>>>>> for 4.7.0 and the docs have been updated to include these samples for
>>>>>> PySpark users.
>>>>>>
>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Nick,
>>>>>>>
>>>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>>>> resolved. With the Spark DataFrame support, all the necessary glue is there
>>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>>
>>>>>>> df = sqlContext.read \
>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>   .option("table", "TABLE1") \
>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>   .load()
>>>>>>>
>>>>>>> And
>>>>>>>
>>>>>>> df.write \
>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>   .mode("overwrite") \
>>>>>>>   .option("table", "TABLE1") \
>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>   .save()
>>>>>>>
>>>>>>>
>>>>>>> Yes, this should be added to the documentation. I hadn't actually
>>>>>>> tried this till just now. :)
>>>>>>>
>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <nd...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Heya,
>>>>>>>>
>>>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>>>
>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with more
>>>>>>>> experience in pyspark knows better? Would be a great addition to our
>>>>>>>> documentation.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> [0]:
>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Josh Mahonin <jm...@gmail.com>.
Sadly, it needs to be installed onto each Spark worker (for now). The
executor config tells each Spark worker to look for that file to add to its
classpath, so once you have it installed, you'll probably need to restart
all the Spark workers.

I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
/usr/hdp/current/phoenix-client/, but anywhere that each worker can
consistently see is fine.

One day we'll be able to have Spark ship the JAR around and use it without
this classpath nonsense, but we need to do some extra work on the Phoenix
side to make sure that Phoenix's calls to DriverManager actually go through
Spark's weird wrapper version of it.

On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <nd...@apache.org> wrote:

> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jm...@gmail.com> wrote:
>
>> What version of Spark are you using?
>>
>
> Probably HDP's Spark 1.4.1; that's what the jars in my install say, and
> the welcome message in the pyspark console agrees.
>
> Are there any other traces of exceptions anywhere?
>>
>
> No other exceptions that I can find. YARN apparently doesn't want to
> aggregate spark's logs.
>
>
>> Are all your Spark nodes set up to point to the same phoenix-client-spark
>> JAR?
>>
>
> Yes, as far as I can tell... though come to think of it, is that jar
> shipped by the spark runtime to workers, or is it loaded locally on each
> host? I only changed spark-defaults.conf on the client machine, the machine
> from which I submitted the job.
>
> Thanks for taking a look Josh!
>
> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <nd...@apache.org> wrote:
>>
>>> Hi guys,
>>>
>>> I'm doing my best to follow along with [0], but I'm hitting some
>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>> using pyspark for now.
>>>
>>> I've added phoenix-$VERSION-client-spark.jar to both
>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>> around, this is commonly a problem with classpath. Any ideas as to
>>> specifically which jars are needed? Or better still, how to debug this
>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>> two netty versions in that lib directory...
>>>
>>> Thanks a lot,
>>> -n
>>>
>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>> [1]:
>>>
>>> java.lang.IllegalStateException: unread block data
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>> at
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>> at
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <ja...@apache.org>
>>> wrote:
>>>
>>>> Thanks for remembering about the docs, Josh.
>>>>
>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jm...@gmail.com>
>>>> wrote:
>>>>
>>>>> Just an update for anyone interested, PHOENIX-2503 was just committed
>>>>> for 4.7.0 and the docs have been updated to include these samples for
>>>>> PySpark users.
>>>>>
>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>
>>>>> Josh
>>>>>
>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Nick,
>>>>>>
>>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>>> resolved. With the Spark DataFrame support, all the necessary glue is there
>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>
>>>>>> df = sqlContext.read \
>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>   .option("table", "TABLE1") \
>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>   .load()
>>>>>>
>>>>>> And
>>>>>>
>>>>>> df.write \
>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>   .mode("overwrite") \
>>>>>>   .option("table", "TABLE1") \
>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>   .save()
>>>>>>
>>>>>>
>>>>>> Yes, this should be added to the documentation. I hadn't actually
>>>>>> tried this till just now. :)
>>>>>>
>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <nd...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Heya,
>>>>>>>
>>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>>
>>>>>>> I did find this example [0] of using HBaseOutputFormat from pyspark,
>>>>>>> haven't tried extending it for phoenix. Maybe someone with more experience
>>>>>>> in pyspark knows better? Would be a great addition to our documentation.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nick
>>>>>>>
>>>>>>> [0]:
>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Nick Dimiduk <nd...@apache.org>.
On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jm...@gmail.com> wrote:

> What version of Spark are you using?
>

Probably HDP's Spark 1.4.1; that's what the jars in my install say, and the
welcome message in the pyspark console agrees.

Are there any other traces of exceptions anywhere?
>

No other exceptions that I can find. YARN apparently doesn't want to
aggregate spark's logs.


> Are all your Spark nodes set up to point to the same phoenix-client-spark
> JAR?
>

Yes, as far as I can tell... though come to think of it, is that jar
shipped by the spark runtime to workers, or is it loaded locally on each
host? I only changed spark-defaults.conf on the client machine, the machine
from which I submitted the job.

Thanks for taking a look Josh!

On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <nd...@apache.org> wrote:
>
>> Hi guys,
>>
>> I'm doing my best to follow along with [0], but I'm hitting some
>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>> using pyspark for now.
>>
>> I've added phoenix-$VERSION-client-spark.jar to both
>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>> around, this is commonly a problem with classpath. Any ideas as to
>> specifically which jars are needed? Or better still, how to debug this
>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>> gives me a VerifyError about netty method version mismatch. Indeed I see
>> two netty versions in that lib directory...
>>
>> Thanks a lot,
>> -n
>>
>> [0]: http://phoenix.apache.org/phoenix_spark.html
>> [1]:
>>
>> java.lang.IllegalStateException: unread block data
>> at
>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <ja...@apache.org>
>> wrote:
>>
>>> Thanks for remembering about the docs, Josh.
>>>
>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jm...@gmail.com>
>>> wrote:
>>>
>>>> Just an update for anyone interested, PHOENIX-2503 was just committed
>>>> for 4.7.0 and the docs have been updated to include these samples for
>>>> PySpark users.
>>>>
>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>
>>>> Josh
>>>>
>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jm...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Nick,
>>>>>
>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>> resolved. With the Spark DataFrame support, all the necessary glue is there
>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>
>>>>> df = sqlContext.read \
>>>>>   .format("org.apache.phoenix.spark") \
>>>>>   .option("table", "TABLE1") \
>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>   .load()
>>>>>
>>>>> And
>>>>>
>>>>> df.write \
>>>>>   .format("org.apache.phoenix.spark") \
>>>>>   .mode("overwrite") \
>>>>>   .option("table", "TABLE1") \
>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>   .save()
>>>>>
>>>>>
>>>>> Yes, this should be added to the documentation. I hadn't actually
>>>>> tried this till just now. :)
>>>>>
>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <nd...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Heya,
>>>>>>
>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>
>>>>>> I did find this example [0] of using HBaseOutputFormat from pyspark,
>>>>>> haven't tried extending it for phoenix. Maybe someone with more experience
>>>>>> in pyspark knows better? Would be a great addition to our documentation.
>>>>>>
>>>>>> Thanks,
>>>>>> Nick
>>>>>>
>>>>>> [0]:
>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Posted by Josh Mahonin <jm...@gmail.com>.
Hey Nick,

What version of Spark are you using?

I just tried using spark-1.5.2-bin-hadoop2.4 with the latest from Phoenix
master (probably the same phoenix-spark code as your version) with pyspark
and was able to do a df.take(). Note that this was on one machine using the
phoenix_sandbox.py and a local spark-shell.

As well, I tried it out using my fork of HDP phoenix with basically those
same patches applied, and a df.take() works for me on a reasonable dataset
across a number of nodes using pyspark.

Are there any other traces of exceptions anywhere? Are all your Spark nodes
set up to point to the same phoenix-client-spark JAR?

Josh

On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <nd...@apache.org> wrote:

> Hi guys,
>
> I'm doing my best to follow along with [0], but I'm hitting some stumbling
> blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix build is
> much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm using
> pyspark for now.
>
> I've added phoenix-$VERSION-client-spark.jar to both
> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
> me to use sqlContext.read to define a DataFrame against a Phoenix table.
> This appears to basically work, as I see PhoenixInputFormat in the logs and
> df.printSchema() shows me what I expect. However, when I try df.take(5), I
> get "IllegalStateException: unread block data" [1] from the workers. Poking
> around, this is commonly a problem with classpath. Any ideas as to
> specifically which jars are needed? Or better still, how to debug this
> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
> gives me a VerifyError about netty method version mismatch. Indeed I see
> two netty versions in that lib directory...
>
> Thanks a lot,
> -n
>
> [0]: http://phoenix.apache.org/phoenix_spark.html
> [1]:
>
> java.lang.IllegalStateException: unread block data
> at
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <ja...@apache.org>
> wrote:
>
>> Thanks for remembering about the docs, Josh.
>>
>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jm...@gmail.com> wrote:
>>
>>> Just an update for anyone interested, PHOENIX-2503 was just committed
>>> for 4.7.0 and the docs have been updated to include these samples for
>>> PySpark users.
>>>
>>> https://phoenix.apache.org/phoenix_spark.html
>>>
>>> Josh
>>>
>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jm...@gmail.com>
>>> wrote:
>>>
>>>> Hey Nick,
>>>>
>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>> resolved. With the Spark DataFrame support, all the necessary glue is there
>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>
>>>> df = sqlContext.read \
>>>>   .format("org.apache.phoenix.spark") \
>>>>   .option("table", "TABLE1") \
>>>>   .option("zkUrl", "localhost:63512") \
>>>>   .load()
>>>>
>>>> And
>>>>
>>>> df.write \
>>>>   .format("org.apache.phoenix.spark") \
>>>>   .mode("overwrite") \
>>>>   .option("table", "TABLE1") \
>>>>   .option("zkUrl", "localhost:63512") \
>>>>   .save()
>>>>
>>>>
>>>> Yes, this should be added to the documentation. I hadn't actually tried
>>>> this till just now. :)
>>>>
>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <nd...@apache.org>
>>>> wrote:
>>>>
>>>>> Heya,
>>>>>
>>>>> Has anyone any experience using phoenix-spark integration from pyspark
>>>>> instead of scala? Folks prefer python around here...
>>>>>
>>>>> I did find this example [0] of using HBaseOutputFormat from pyspark,
>>>>> haven't tried extending it for phoenix. Maybe someone with more experience
>>>>> in pyspark knows better? Would be a great addition to our documentation.
>>>>>
>>>>> Thanks,
>>>>> Nick
>>>>>
>>>>> [0]:
>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>
>>>>
>>>>
>>>
>>
>