You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2015/02/04 03:06:18 UTC

[GitHub] spark pull request: [WIP] [SPARK-5577] Python udf for DataFrame

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/4351

    [WIP] [SPARK-5577] Python udf for DataFrame

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark python_udf

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4351.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4351
    
----
commit 3ab26614b5278edce6e8571e5c51fe0b67e3124e
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-03T08:08:00Z

    add more tests for DataFrame

commit 6040ba73431cc22d8d777555db6b35241275bdce
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-03T09:09:36Z

    fix docs

commit 9ab78b4262961deafe0256c8c28d2911a4c07b0a
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-03T09:10:54Z

    Merge branch 'master' of github.com:apache/spark into fix_df
    
    Conflicts:
    	sql/core/src/main/scala/org/apache/spark/sql/Column.scala

commit 78ebcfa6ba750e081f6b5c7b07c8d04f32c2d4d6
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-03T09:12:02Z

    add sql_test.py in run_tests

commit 35ccb9f5721266a3a25df7e5f6d4b2c98f5f18d5
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-03T09:23:16Z

    fix build

commit 8dd19a912e8595dddeec56fea964ab40b5b9f738
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-03T18:00:04Z

    fix tests in python 2.6

commit c052f6fe0aaaf688a8f08e0fe04abdeea8933448
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-03T18:44:36Z

    Merge branch 'master' of github.com:apache/spark into fix_df

commit 83c92fedc4f69dfff909d61899c906cea357498f
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-03T20:21:08Z

    address comments

commit 467332cacca8754f04271a70bbaf15c8f2afd5c6
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-03T20:34:16Z

    support string in cast()

commit dd9919f115d3b8f4b66d213c4a57bc832ed8ed57
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-03T22:17:09Z

    fix tests

commit 1e4766485b20629a9cee12fc1c4751fc427cc569
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-04T01:24:15Z

    Merge branch 'master' of github.com:apache/spark into python_udf

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72798853
  
      [Test build #26727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26727/consoleFull) for   PR 4351 at commit [`7bccc3b`](https://github.com/apache/spark/commit/7bccc3bb7d1e829a3f8d91f508e12899d225235e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72807954
  
      [Test build #26727 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26727/consoleFull) for   PR 4351 at commit [`7bccc3b`](https://github.com/apache/spark/commit/7bccc3bb7d1e829a3f8d91f508e12899d225235e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72931323
  
      [Test build #26770 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26770/consoleFull) for   PR 4351 at commit [`34234d4`](https://github.com/apache/spark/commit/34234d4733c37a6e47aee1712769a6b7503ca80b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5577] Python udf for DataFrame

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72786434
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26705/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4351#discussion_r24068994
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dsl.scala ---
    @@ -177,6 +181,23 @@ object Dsl {
         cols.toList.toSeq
       }
     
    +  /**
    +   * This is a private API for Python
    +   * TODO: move this to a private package
    +   */
    +  def pythonUDF(
    --- End diff --
    
    we already have two functions here - why don't we take this chance to move both into a separate class? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72946539
  
      [Test build #26770 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26770/consoleFull) for   PR 4351 at commit [`34234d4`](https://github.com/apache/spark/commit/34234d4733c37a6e47aee1712769a6b7503ca80b).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  class DirectKafkaInputDStreamCheckpointData extends DStreamCheckpointData(this) `
      * `class KafkaCluster(val kafkaParams: Map[String, String]) extends Serializable `
      * `  case class LeaderOffset(host: String, port: Int, offset: Long)`
      * `class KafkaRDDPartition(`
      * `trait HasOffsetRanges `
      * `class UserDefinedFunction(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72925964
  
      [Test build #26766 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26766/consoleFull) for   PR 4351 at commit [`440f769`](https://github.com/apache/spark/commit/440f76922bba64b701c1cab2f762e6811d0a558e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class UserDefinedFunction(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72964397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26778/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72906625
  
      [Test build #582 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/582/consoleFull) for   PR 4351 at commit [`f0a3121`](https://github.com/apache/spark/commit/f0a31217ed7d837dc98ff974f8417bb456fd49af).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72923342
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26765/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72786425
  
      [Test build #26705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26705/consoleFull) for   PR 4351 at commit [`c6d0d59`](https://github.com/apache/spark/commit/c6d0d592738c7bb459852d60287925d8f0a30a4b).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class FPGrowthModel[Item: ClassTag](`
      * `class Dsl(object):`
      * `class ExamplePointUDT(UserDefinedType):`
      * `class SQLTests(ReusedPySparkTestCase):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72812507
  
      [Test build #26740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26740/consoleFull) for   PR 4351 at commit [`f99b2e1`](https://github.com/apache/spark/commit/f99b2e12ccb03fa9e5803d7379535f7dc54dcab4).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72948424
  
      [Test build #26778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26778/consoleFull) for   PR 4351 at commit [`d250692`](https://github.com/apache/spark/commit/d25069257d6a195f6d7c3b848bc32f9764a7f6b1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72966402
  
    @rxin I think this is ready.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4351


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4351#discussion_r24065334
  
    --- Diff: python/pyspark/sql.py ---
    @@ -2263,18 +2263,6 @@ def subtract(self, other):
             """
             return DataFrame(getattr(self._jdf, "except")(other._jdf), self.sql_ctx)
     
    -    def sample(self, withReplacement, fraction, seed=None):
    --- End diff --
    
    oops mistakenly deleted my own comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72966714
  
    Thanks. Merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72925979
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26766/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72832301
  
      [Test build #26740 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26740/consoleFull) for   PR 4351 at commit [`f99b2e1`](https://github.com/apache/spark/commit/f99b2e12ccb03fa9e5803d7379535f7dc54dcab4).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4351#discussion_r24068933
  
    --- Diff: python/pyspark/sql.py ---
    @@ -2077,9 +2077,9 @@ def dtypes(self):
             """Return all column names and their data types as a list.
     
             >>> df.dtypes
    -        [(u'age', 'IntegerType'), (u'name', 'StringType')]
    +        [('age', 'integer'), ('name', 'string')]
             """
    -        return [(f.name, str(f.dataType)) for f in self.schema().fields]
    +        return [(str(f.name), f.dataType.jsonValue()) for f in self.schema().fields]
    --- End diff --
    
    we should use simplestring (which isn't available yet...) we can change it in the future


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5577] Python udf for DataFrame

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4351#discussion_r24062243
  
    --- Diff: python/pyspark/sql.py ---
    @@ -2263,18 +2263,6 @@ def subtract(self, other):
             """
             return DataFrame(getattr(self._jdf, "except")(other._jdf), self.sql_ctx)
     
    -    def sample(self, withReplacement, fraction, seed=None):
    --- End diff --
    
    why are we removing sample?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72809286
  
      [Test build #26737 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26737/consoleFull) for   PR 4351 at commit [`462b334`](https://github.com/apache/spark/commit/462b3341f38e14e69ac1974ba6f56e737f31f004).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72832313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26740/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72923325
  
      [Test build #26765 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26765/consoleFull) for   PR 4351 at commit [`f0a3121`](https://github.com/apache/spark/commit/f0a31217ed7d837dc98ff974f8417bb456fd49af).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class UserDefinedFunction(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72907051
  
      [Test build #26766 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26766/consoleFull) for   PR 4351 at commit [`440f769`](https://github.com/apache/spark/commit/440f76922bba64b701c1cab2f762e6811d0a558e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72964394
  
      [Test build #26778 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26778/consoleFull) for   PR 4351 at commit [`d250692`](https://github.com/apache/spark/commit/d25069257d6a195f6d7c3b848bc32f9764a7f6b1).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class UserDefinedFunction(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72925442
  
      [Test build #582 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/582/consoleFull) for   PR 4351 at commit [`f0a3121`](https://github.com/apache/spark/commit/f0a31217ed7d837dc98ff974f8417bb456fd49af).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class UserDefinedFunction(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72777617
  
      [Test build #26705 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26705/consoleFull) for   PR 4351 at commit [`c6d0d59`](https://github.com/apache/spark/commit/c6d0d592738c7bb459852d60287925d8f0a30a4b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4351#discussion_r24068977
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
    @@ -174,6 +173,22 @@ trait Column extends DataFrame {
       }
     
       /**
    +   * Inequality test.
    +   * {{{
    +   *   // Scala:
    +   *   df.select( df("colA") !== df("colB") )
    --- End diff --
    
    you probably want to update the javadoc for !== also.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5577] Python udf for DataFrame

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4351#discussion_r24063741
  
    --- Diff: python/pyspark/sql.py ---
    @@ -2263,18 +2263,6 @@ def subtract(self, other):
             """
             return DataFrame(getattr(self._jdf, "except")(other._jdf), self.sql_ctx)
     
    -    def sample(self, withReplacement, fraction, seed=None):
    --- End diff --
    
    there are two sample().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72829660
  
      [Test build #26737 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26737/consoleFull) for   PR 4351 at commit [`462b334`](https://github.com/apache/spark/commit/462b3341f38e14e69ac1974ba6f56e737f31f004).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4351#discussion_r24068962
  
    --- Diff: python/pyspark/sql.py ---
    @@ -2593,6 +2553,45 @@ def _(col):
         return staticmethod(_)
     
     
    +class UserDefinedFunction(object):
    +    def __init__(self, func, returnType):
    +        self.func = func
    +        self.returnType = returnType
    +        self._judf = self._create_judf()
    +
    +    def _create_judf(self):
    --- End diff --
    
    can u add some inline comment explaining what's happening in this function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72807964
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26727/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72946550
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26770/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4351#discussion_r24068970
  
    --- Diff: python/pyspark/sql.py ---
    @@ -2651,6 +2650,16 @@ def approxCountDistinct(col, rsd=None):
                 jc = sc._jvm.Dsl.approxCountDistinct(_to_java_column(col), rsd)
             return Column(jc)
     
    +    @staticmethod
    +    def udf(f, returnType=StringType()):
    +        """Create a user defined function (UDF)
    +
    +        >>> slen = Dsl.udf(lambda s: len(s), IntegerType())
    +        >>> df.select(slen(df.name).As('slen')).collect()
    --- End diff --
    
    use alias in the docstring


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72904323
  
      [Test build #26765 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26765/consoleFull) for   PR 4351 at commit [`f0a3121`](https://github.com/apache/spark/commit/f0a31217ed7d837dc98ff974f8417bb456fd49af).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5577] Python udf for DataFrame

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72829667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26737/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5577] Python udf for DataFrame

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4351#issuecomment-72778536
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26703/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org