You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by jmahonin <gi...@git.apache.org> on 2015/04/06 16:04:50 UTC

[GitHub] phoenix pull request: PHOENIX-1815 Use Spark Data Source API in ph...

GitHub user jmahonin opened a pull request:

    https://github.com/apache/phoenix/pull/63

    PHOENIX-1815 Use Spark Data Source API in phoenix-spark module

    This allows using the SQLContext.load() functionality to create
    a Phoenix data frame, which also supports push-down on column
    or predicate filtering from Spark SQL.
    
    As well, DataFrame.save() is supported for persisting DataFrames
    back to Phoenix.
    
    This may work with Spark's standalone SQL server mode, but it
    hasn't been tested.
    
    ref:
    https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources
    
    https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/FileTrek/phoenix PHOENIX-1815

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/phoenix/pull/63.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #63
    
----
commit c62178a1016a7885bd2c082fd1380c9a3023ca34
Author: Josh Mahonin <jm...@gmail.com>
Date:   2015-03-25T19:40:10Z

    PHOENIX-1815 Use Spark Data Source API in phoenix-spark module
    
    This allows using the SQLContext.load() functionality to create
    a Phoenix data frame, which also supports push-down on column
    or predicate filtering from Spark SQL.
    
    As well, DataFrame.save() is supported for persisting DataFrames
    back to Phoenix.
    
    This may work with Spark's standalone SQL server mode, but it
    hasn't been tested.
    
    ref:
    https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources
    
    https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] phoenix pull request: PHOENIX-1815 Use Spark Data Source API in ph...

Posted by mravi <gi...@git.apache.org>.
Github user mravi commented on the pull request:

    https://github.com/apache/phoenix/pull/63#issuecomment-92298667
  
    @jmahonin 
          Would it be good to have ProductRDDFunctions renamed to  PhoenixRDDFunctions? I believe you had a good reason to name it as that in the first instance. Thoughts?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] phoenix pull request: PHOENIX-1815 Use Spark Data Source API in ph...

Posted by jmahonin <gi...@git.apache.org>.
Github user jmahonin closed the pull request at:

    https://github.com/apache/phoenix/pull/63


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] phoenix pull request: PHOENIX-1815 Use Spark Data Source API in ph...

Posted by jmahonin <gi...@git.apache.org>.
Github user jmahonin commented on the pull request:

    https://github.com/apache/phoenix/pull/63#issuecomment-92903008
  
    Re: step c, I would lean towards not including either the spark or scala library JARs. They are provided by the Spark runtime itself, so I'm not sure it makes sense to bundle them within the phoenix assembly JAR. Does that make sense to you?
    
    ref:
    https://spark.apache.org/docs/latest/submitting-applications.html
    https://github.com/sbt/sbt-assembly#-provided-configuration
    
    After those, I think the only other runtime dependency not already in the all-common-jars file is one for snappy-java, which I'm not sure is explicitly needed anymore that we're part of a multi-module build. I will double-check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] phoenix pull request: PHOENIX-1815 Use Spark Data Source API in ph...

Posted by mravi <gi...@git.apache.org>.
Github user mravi commented on the pull request:

    https://github.com/apache/phoenix/pull/63#issuecomment-90094412
  
    Thanks @jmahonin  for the quick turnaround on this.
     I will review this today and get back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] phoenix pull request: PHOENIX-1815 Use Spark Data Source API in ph...

Posted by jmahonin <gi...@git.apache.org>.
Github user jmahonin commented on the pull request:

    https://github.com/apache/phoenix/pull/63#issuecomment-91239667
  
    Merged in 'master' to update with new integration tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] phoenix pull request: PHOENIX-1815 Use Spark Data Source API in ph...

Posted by jmahonin <gi...@git.apache.org>.
Github user jmahonin commented on the pull request:

    https://github.com/apache/phoenix/pull/63#issuecomment-92981336
  
    I misread your review at first and had thought you asked to update all-common-dependencies, rather than all-common JARS.
    
    a), b) and c) are all addressed in the latest commit here, and final patchfiles are attached to the JIRA ticket.
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] phoenix pull request: PHOENIX-1815 Use Spark Data Source API in ph...

Posted by mravi <gi...@git.apache.org>.
Github user mravi commented on the pull request:

    https://github.com/apache/phoenix/pull/63#issuecomment-92636784
  
    Thanks @jmahonin  for making the necessary changes .  Below are the final set of tasks that need to be done before we publish documentation.
    
    a) In phoenix/pom.xml , Add the dependency of phoenix-spark module under <dependencyManagement>.  
    https://github.com/apache/phoenix/blob/master/pom.xml#L427
    b) In phoenix/phoenix-assembly/pom.xml , Add the dependency of phoenix-spark module .    https://github.com/apache/phoenix/blob/master/phoenix-assembly/pom.xml#L148
    c) In phoenix/phoenx-assembly/build/all-common-jars.xml , register spark artifacts.
       https://github.com/apache/phoenix/blob/master/phoenix-assembly/src/build/components/all-common-jars.xml#L73
    d) Generating HTML  from the README . 
    
    Is it possible for you to create patches for both the master the 4.4.X HBase 0.98 branch and attach to the JIRA? I can take care of the d) task and push to SVN . 
    
    Thanks!!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] phoenix pull request: PHOENIX-1815 Use Spark Data Source API in ph...

Posted by jmahonin <gi...@git.apache.org>.
Github user jmahonin commented on the pull request:

    https://github.com/apache/phoenix/pull/63#issuecomment-92343659
  
    Thanks for the review @mravi 
    
    That HBaseConfiguration.create() step is a great idea, I'll make that change ASAP.
    
    Re: naming scheme, I'd attempted to follow Cassandra-Spark connector, since there's not yet too much available for reference code, but also the feature sets would be relatively closely aligned:
    https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector
    
    Although I'm not completely married to the idea, both Datastax (Cassandra) and Databricks (Spark) seem to follow a _Functions.scala scheme, where _ is the class to which implicit helper parameters are being attached. In this case, the new 'ProductRDDFunctions' applies the implicit helper function 'saveToPhoenix' to objects of type RDD[Product], or an RDD of tuples.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] phoenix pull request: PHOENIX-1815 Use Spark Data Source API in ph...

Posted by mravi <gi...@git.apache.org>.
Github user mravi commented on the pull request:

    https://github.com/apache/phoenix/pull/63#issuecomment-92292084
  
    @jmahonin 
       One minor change.  Can you please replace the following line in PhoenixRDD.scala and ProductRDDFunctions. This will ensure we load hbase-site.xml so all configuration parameters set for HBase / Phoenix can be applied.
     
    ---------------------------
    
    val config = new Configuration(conf)  . 
    to
    val config = HBaseConfiguration.create(conf)  



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---