You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by xiaopeng-liao <gi...@git.apache.org> on 2016/08/25 12:20:50 UTC

[GitHub] phoenix pull request #196: [PHOENIX-2648] Add dynamic column support for spa...

GitHub user xiaopeng-liao opened a pull request:

    https://github.com/apache/phoenix/pull/196

    [PHOENIX-2648] Add dynamic column support for spark integration

    It supports both RDD and Dataframe read /write, 
    Things needed consideration
    ======
    When loading from Dataframe, there is a need to convert from catalyst data type to Phoenix type, ex. 
    StringType to Varchar, Array<Integer> to INTEGER_ARRAY,. etc. The code is under phoenix-spark/src/main/scala/org.apache.phoenix.spark.DataFrameFunctions.scala
    
    Usages
    =======
    - **RDD**
    
    **Save**
    ```
    val dataSet = List((1L, "1", 1, 1), (2L, "2", 2, 2), (3L, "3", 3, 3))
    sc
      .parallelize(dataSet)
      .saveToPhoenix(
        "OUTPUT_TEST_TABLE",
        Seq("ID", "COL1", "COL2", "COL4<INTEGER"),
        hbaseConfiguration
    )
    ```
    
    **Read**
    ```
        val columnNames = Seq("ID", "COL1", "COL2", "COL5<INTEGER")
        // Load the results back
        val loaded = sc.phoenixTableAsRDD(
          "OUTPUT_TEST_TABLE",columnNames,
          conf = hbaseConfiguration
        )
    ```
    
    - **Dataframe**
    
    **Save**
    It will get data types from Dataframe and convert to Phoenix supported types
    ```
    val dataSet = List((1L, "1", 1, 1,"2"), (2L, "2", 2, 2,"3"), (3L, "3", 3, 3,"4"))
    sc
      .parallelize(dataSet).toDF("ID","COL1","COL2","COL6","COL7")
      .saveToPhoenix("OUTPUT_TEST_TABLE",zkUrl = Some(quorumAddress))
    ```
    
    **Read**
    ```
    val df1 = sqlContext.phoenixTableAsDataFrame("OUTPUT_TEST_TABLE", Array("ID", 
        "COL1","COL6<INTEGER", "COL7<VARCHAR"), conf = hbaseConfiguration)
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xiaopeng-liao/phoenix phoenix-addsparkdynamic

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/phoenix/pull/196.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #196
    
----
commit a2dc6101d96333f781ff9e905c47c035f8b89462
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-17T12:13:58Z

    add dynamic column support for SPARK rdd

commit 6969287db5ea341bc3876af55f7d0ef3acb035c2
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-18T09:46:38Z

    add dynamic column support for reading from PhoenixRDD.

commit 5688b6c90c66b02cc22fcac6e67b9712d7eb660e
Author: xiaopeng-liao <xp...@gmail.com>
Date:   2016-08-19T14:52:27Z

    Merge pull request #1 from apache/master
    
    merge in latest changes from phoenix

commit a9b217e55393f613e9ca168faccd93e7626c7324
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-23T10:51:34Z

    [PHOENIX-2648] add support for dynamic columns for RDD and Dataframe

commit 51190865375397581cbd1d6b960c79be7d727b97
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-23T10:52:27Z

    Merge branch 'phoenix-addsparkdynamic' of https://github.com/xiaopeng-liao/phoenix into phoenix-addsparkdynamic

commit 6cbd6314782a6eb1a4c69eae25371791e4d64f90
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-23T13:00:55Z

    Remove the configuration for enable dynamic column as it is not used anyway

commit 8602554c875229f376499c082894cc33999f3e7b
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-23T15:01:29Z

    More clean up, remove the configuration for dynamic column

commit d3a4f1575f4b376df32f6d28aeba14270ce58088
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-25T08:44:47Z

    [PHOENIX-2648] change dynamic column format from COL:DataType to COL<DataType becaues it conflict with index syntax

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---