You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2015/02/13 08:42:21 UTC

[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/4585

    [SPARK-5793][SQL] Add explode to Column

    Add explode function to Column.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 column_explode

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4585.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4585
    
----
commit 86f530a76d76034f822cf7c3cfe65d649ca1fcc1
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2015-02-13T07:31:50Z

    Add explode function to Column.

commit 5011ccb604a74065f259ef8ca269398e649252cc
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2015-02-13T07:40:46Z

    Add unit test.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74216811
  
      [Test build #27426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27426/consoleFull) for   PR 4585 at commit [`5011ccb`](https://github.com/apache/spark/commit/5011ccb604a74065f259ef8ca269398e649252cc).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4585#discussion_r24714073
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
    @@ -576,6 +578,25 @@ trait Column extends DataFrame {
       override def as(alias: Symbol): Column = exprToColumn(Alias(expr, alias.name)())
     
       /**
    +   * (Scala-specific) Explodes the column to zero or more rows by the provided function.
    +   * {{{
    +   *   val df = Seq(Tuple1("a b c"), Tuple1("d e")).toDataFrame("words")
    +   *   val col = df("words")
    +   *   col.explode {words: String => words.split(" ")}
    +   * }}}
    +   */
    +  def explode[A, B : TypeTag](f: A => TraversableOnce[B]): Column = {
    +    val dataType = ScalaReflection.schemaFor[B].dataType
    +    val attributes = AttributeReference(schema.fields(0).name, dataType)() :: Nil
    +    def rowFunction(row: Row) = {
    +      f(row(0).asInstanceOf[A]).map(o => Row(ScalaReflection.convertToCatalyst(o, dataType)))
    +    }
    +    val generator = UserDefinedGenerator(attributes, rowFunction, expr :: Nil)
    +    val plan = Generate(generator, join = true, outer = false, None, logicalPlan)
    +    Column(sqlContext, Project(attributes, plan), attributes(0))
    --- End diff --
    
    Is it? Isn't the expression in project just seq( attribute(0) )?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4585#discussion_r24719184
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
    @@ -576,6 +578,25 @@ trait Column extends DataFrame {
       override def as(alias: Symbol): Column = exprToColumn(Alias(expr, alias.name)())
     
       /**
    +   * (Scala-specific) Explodes the column to zero or more rows by the provided function.
    +   * {{{
    +   *   val df = Seq(Tuple1("a b c"), Tuple1("d e")).toDataFrame("words")
    +   *   val col = df("words")
    +   *   col.explode {words: String => words.split(" ")}
    +   * }}}
    +   */
    +  def explode[A, B : TypeTag](f: A => TraversableOnce[B]): Column = {
    +    val dataType = ScalaReflection.schemaFor[B].dataType
    +    val attributes = AttributeReference(schema.fields(0).name, dataType)() :: Nil
    +    def rowFunction(row: Row) = {
    +      f(row(0).asInstanceOf[A]).map(o => Row(ScalaReflection.convertToCatalyst(o, dataType)))
    +    }
    +    val generator = UserDefinedGenerator(attributes, rowFunction, expr :: Nil)
    +    val plan = Generate(generator, join = true, outer = false, None, logicalPlan)
    +    Column(sqlContext, Project(attributes, plan), attributes(0))
    --- End diff --
    
    @rxin are you sure that will cause problem? but it passes tests? any suggestion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4585#discussion_r24720118
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
    @@ -576,6 +578,25 @@ trait Column extends DataFrame {
       override def as(alias: Symbol): Column = exprToColumn(Alias(expr, alias.name)())
     
       /**
    +   * (Scala-specific) Explodes the column to zero or more rows by the provided function.
    +   * {{{
    +   *   val df = Seq(Tuple1("a b c"), Tuple1("d e")).toDataFrame("words")
    +   *   val col = df("words")
    +   *   col.explode {words: String => words.split(" ")}
    +   * }}}
    +   */
    +  def explode[A, B : TypeTag](f: A => TraversableOnce[B]): Column = {
    +    val dataType = ScalaReflection.schemaFor[B].dataType
    +    val attributes = AttributeReference(schema.fields(0).name, dataType)() :: Nil
    +    def rowFunction(row: Row) = {
    +      f(row(0).asInstanceOf[A]).map(o => Row(ScalaReflection.convertToCatalyst(o, dataType)))
    +    }
    +    val generator = UserDefinedGenerator(attributes, rowFunction, expr :: Nil)
    +    val plan = Generate(generator, join = true, outer = false, None, logicalPlan)
    +    Column(sqlContext, Project(attributes, plan), attributes(0))
    --- End diff --
    
    what if you do
    
    col.explode (... ) + 1 
    
    and then collect on that? does that work? (it might but i'm not 100% sure)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-85295470
  
    Agree. If I want to try to add the explode function, add it in this pr or close this and do it in new pr, which is better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74411009
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27512/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74224200
  
      [Test build #27426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27426/consoleFull) for   PR 4585 at commit [`5011ccb`](https://github.com/apache/spark/commit/5011ccb604a74065f259ef8ca269398e649252cc).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74386002
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27490/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74361382
  
      [Test build #27474 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27474/consoleFull) for   PR 4585 at commit [`72d053c`](https://github.com/apache/spark/commit/72d053c449e3ba61be08ca848ff03908b7890ffe).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-82269431
  
      [Test build #28717 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28717/consoleFull) for   PR 4585 at commit [`fa419d5`](https://github.com/apache/spark/commit/fa419d575586f9fc4f6978a87626ca0007542b4c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74418506
  
    @rxin any more suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4585#discussion_r24712170
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
    @@ -576,6 +578,27 @@ trait Column extends DataFrame {
       override def as(alias: Symbol): Column = exprToColumn(Alias(expr, alias.name)())
     
       /**
    +   * (Scala-specific) Explodes the column to zero or more rows by the provided function.
    +   * {{{
    +   *   val df = Seq(Tuple1("a b c"), Tuple1("d e")).toDataFrame("words")
    +   *   val col = df("words")
    +   *   col.explode("word"){words: String => words.split(" ")}
    +   * }}}
    +   */
    +  def explode[A, B : TypeTag](
    +      outputColumn: String)(
    --- End diff --
    
    Okay, that is better. It should not break overloading because its signature is different to DataFrame's.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74363431
  
      [Test build #27474 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27474/consoleFull) for   PR 4585 at commit [`72d053c`](https://github.com/apache/spark/commit/72d053c449e3ba61be08ca848ff03908b7890ffe).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74363435
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27474/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-85311963
  
    Might be best to just close this one and submit a new pr. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-82268868
  
    @marmbrus @rxin Updated to master. Please review it when you have time. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-85175007
  
    To elaborate.  If we add this we can't have a very simple method that just explodes repeated fields without any transformation because of overloading and we think that would be more useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya closed the pull request at:

    https://github.com/apache/spark/pull/4585


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74386000
  
      [Test build #27490 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27490/consoleFull) for   PR 4585 at commit [`a3d822c`](https://github.com/apache/spark/commit/a3d822c22052af7ed5824e891a71633353675f2f).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4585#discussion_r24703364
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
    @@ -576,6 +578,27 @@ trait Column extends DataFrame {
       override def as(alias: Symbol): Column = exprToColumn(Alias(expr, alias.name)())
     
       /**
    +   * (Scala-specific) Explodes the column to zero or more rows by the provided function.
    +   * {{{
    +   *   val df = Seq(Tuple1("a b c"), Tuple1("d e")).toDataFrame("words")
    +   *   val col = df("words")
    +   *   col.explode("word"){words: String => words.split(" ")}
    +   * }}}
    +   */
    +  def explode[A, B : TypeTag](
    +      outputColumn: String)(
    --- End diff --
    
    this is equivalant to just explode(f).as(outputColumn) right ? in that case, i wouldn't use multiline args. just have a single function and let the user call as on it. 
    
    (does that break overloading?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-79515875
  
    Hey, sorry this didn't make it in for Spark 1.3.  Assuming @rxin thinks this API is reasonable, would you mind updating to master?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-84932929
  
    @marmbrus @rxin Can you review this to see if it is okay?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-82313386
  
      [Test build #28717 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28717/consoleFull) for   PR 4585 at commit [`fa419d5`](https://github.com/apache/spark/commit/fa419d575586f9fc4f6978a87626ca0007542b4c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74411006
  
      [Test build #27512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27512/consoleFull) for   PR 4585 at commit [`757c5a7`](https://github.com/apache/spark/commit/757c5a796ff4af4991405e4fb4a3cdd07faa22b7).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74383258
  
      [Test build #27490 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27490/consoleFull) for   PR 4585 at commit [`a3d822c`](https://github.com/apache/spark/commit/a3d822c22052af7ed5824e891a71633353675f2f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74409066
  
      [Test build #27512 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27512/consoleFull) for   PR 4585 at commit [`757c5a7`](https://github.com/apache/spark/commit/757c5a796ff4af4991405e4fb4a3cdd07faa22b7).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-82210952
  
    Sure. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74224207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27426/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74380831
  
      [Test build #27489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27489/consoleFull) for   PR 4585 at commit [`b99ad8a`](https://github.com/apache/spark/commit/b99ad8a7ff024e1ce9b714e53d54d15e16032753).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-85313425
  
    okay. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-82313404
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28717/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4585#discussion_r24720745
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
    @@ -576,6 +578,25 @@ trait Column extends DataFrame {
       override def as(alias: Symbol): Column = exprToColumn(Alias(expr, alias.name)())
     
       /**
    +   * (Scala-specific) Explodes the column to zero or more rows by the provided function.
    +   * {{{
    +   *   val df = Seq(Tuple1("a b c"), Tuple1("d e")).toDataFrame("words")
    +   *   val col = df("words")
    +   *   col.explode {words: String => words.split(" ")}
    +   * }}}
    +   */
    +  def explode[A, B : TypeTag](f: A => TraversableOnce[B]): Column = {
    +    val dataType = ScalaReflection.schemaFor[B].dataType
    +    val attributes = AttributeReference(schema.fields(0).name, dataType)() :: Nil
    +    def rowFunction(row: Row) = {
    +      f(row(0).asInstanceOf[A]).map(o => Row(ScalaReflection.convertToCatalyst(o, dataType)))
    +    }
    +    val generator = UserDefinedGenerator(attributes, rowFunction, expr :: Nil)
    +    val plan = Generate(generator, join = true, outer = false, None, logicalPlan)
    +    Column(sqlContext, Project(attributes, plan), attributes(0))
    --- End diff --
    
    I just test it and it works. Added a test for that too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4585#discussion_r24713046
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
    @@ -576,6 +578,25 @@ trait Column extends DataFrame {
       override def as(alias: Symbol): Column = exprToColumn(Alias(expr, alias.name)())
     
       /**
    +   * (Scala-specific) Explodes the column to zero or more rows by the provided function.
    +   * {{{
    +   *   val df = Seq(Tuple1("a b c"), Tuple1("d e")).toDataFrame("words")
    +   *   val col = df("words")
    +   *   col.explode {words: String => words.split(" ")}
    +   * }}}
    +   */
    +  def explode[A, B : TypeTag](f: A => TraversableOnce[B]): Column = {
    +    val dataType = ScalaReflection.schemaFor[B].dataType
    +    val attributes = AttributeReference(schema.fields(0).name, dataType)() :: Nil
    +    def rowFunction(row: Row) = {
    +      f(row(0).asInstanceOf[A]).map(o => Row(ScalaReflection.convertToCatalyst(o, dataType)))
    +    }
    +    val generator = UserDefinedGenerator(attributes, rowFunction, expr :: Nil)
    +    val plan = Generate(generator, join = true, outer = false, None, logicalPlan)
    +    Column(sqlContext, Project(attributes, plan), attributes(0))
    --- End diff --
    
    i think this will break a lot of things because attributes(0) is different from the expression in project


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74621954
  
    @rxin is this pr ready to go?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74381183
  
      [Test build #27489 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27489/consoleFull) for   PR 4585 at commit [`b99ad8a`](https://github.com/apache/spark/commit/b99ad8a7ff024e1ce9b714e53d54d15e16032753).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-85174485
  
    I thought about this more -- I'm not sure how much this brings over the explode function in a DataFrame. On the other hand, I think it would be great to add to the expression functions an explode function that actually explodes arrays or structs. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5793][SQL] Add explode to Column

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4585#issuecomment-74381185
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27489/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org