You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ahirreddy <gi...@git.apache.org> on 2014/04/09 04:22:58 UTC

[GitHub] spark pull request: PySpark API for SparkSQL

GitHub user ahirreddy opened a pull request:

    https://github.com/apache/spark/pull/363

    PySpark API for SparkSQL

    An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.
    
    ```
    from pyspark.context import SQLContext
    sqlCtx = SQLContext(sc)
    rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    srdd = sqlCtx.applySchema(rdd)
    sqlCtx.registerRDDAsTable(srdd, "table1")
    srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
    srdd2.collect()
    ```
    The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ahirreddy/spark pysql

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/363.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #363
    
----
commit b4bc82d2072e0ddb2204a04404d7afa2b2263aa9
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-06T22:00:47Z

    compiling

commit b6f4feb3c4917f463d2f54647dd2781a20fc63bc
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-06T22:03:59Z

    Java to python

commit 5cb8dc05a03a74f37f6cdaff165b3f1d1a94c1db
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-06T23:07:10Z

    java to python, and python to java

commit d2c60af513afca5aec0292316c9c0516de66927f
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-07T04:41:09Z

    Added schema rdd class

commit 949071bfd269f0ac608bfa470474c91cae97f91f
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-07T22:45:55Z

    doesn't crash

commit 9cb15c858dbacfe6156c7289575d0d1baa5a986c
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-07T23:09:22Z

    working

commit 730803e0843a3497d4bdf663a86363b33a8883c2
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-07T23:47:36Z

    more working

commit 837bd13bfa2e757ca6cdbe79af1ae00cba7749f0
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T00:16:57Z

    even better

commit 224add86bf0ca3af5c478d8189103463f2ed9918
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T01:26:48Z

    yippie

commit f16524d873d5b7e1f881d1d2bab66a88f9193bd7
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T04:25:14Z

    Switched to using Scala SQLContext

commit d69594dca922f87aa4ac05c4ab0b59a47eb12e5b
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T05:11:23Z

    returning dictionaries works

commit 337ed16ea5d30fc9e51415607cde2f24219c5624
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T05:17:48Z

    output dictionaries correctly

commit ed9e3b447f0114e8bbe02166fb61d7965a4eb641
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T05:25:26Z

    return row objects

commit 2d44498d9932821437fc3c0794eafe357591d86d
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T05:42:19Z

    awesome row objects

commit 1f6e3436291572bbcda267179b320daa939e7b8e
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T06:13:01Z

    SchemaRDD now has all RDD operations

commit ef91795554afd59c5fefa61721df62354094b92d
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T06:19:05Z

    made jrdd explicitly lazy

commit ec5b6e63782f3d181546bcd00bfb05e039a52b1d
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T06:33:52Z

    for now only allow dictionaries as input

commit 6c690e590e214c6f4e4e7f28eaedb30874df5ec6
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T06:36:53Z

    added todo explaining cost of creating Row object in python

commit 7e270b49a042e3e5f98ac030f3371eac838a716a
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T19:01:06Z

    adding tests

commit 90ab8f5365df1a1db98dbf7f7ae00f7c1ae4fa6f
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T19:32:12Z

    added test

commit 6417b7cbd99b710fa7b25d6858f843b3582e95c2
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T19:52:40Z

    added more tests :)

commit be5734e3ae0ff589bf5a25c02fbc819eaf0c0a1e
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T20:20:08Z

    added more tests

commit 22413b350e14d5d5103be3d8dce07660f86283fd
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T22:29:55Z

    Added pyrolite dependency

commit 3e874c6fca86e0c407b88d1cd861b1f1ba1fe685
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T22:43:32Z

    Added tests and documentation

commit 068ff77e84b5f8d72d32a1969f8f116d3cbd9f09
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T22:48:30Z

    doctest formatting

commit 052b4b70a2909cbb0b6fc1e2c61d066765fccc11
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-08T22:53:04Z

    cleaning up cruft

commit 08580e1f1001cc361a63ac9122ac4cb86f0abaff
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-09T01:01:31Z

    HiveContexts

commit 83a0cc6c9690c6bbd133d2e9e2b284c6d03ab0da
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-04-09T02:08:36Z

    Added Long, Double and Boolean as usable types + unit test

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40251818
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40289932
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40169165
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14044/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40291397
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14080/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40432158
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40444369
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11425946
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -284,6 +286,42 @@ private[spark] object PythonRDD {
         file.close()
       }
     
    +  def pythonToJava(pyRDD: JavaRDD[Array[Byte]]): JavaRDD[_] = {
    +    pyRDD.rdd.mapPartitions { iter =>
    +      val unpickle = new Unpickler
    +      // TODO: Figure out why flatMap is necessay for pyspark
    +      iter.flatMap { row =>
    +        unpickle.loads(row) match {
    +          case objs: java.util.ArrayList[Any] => objs
    --- End diff --
    
    I'd use an existential type (`java.util.ArrayList[_]`) here to avoid the compiler warning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40440628
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14130/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11514015
  
    --- Diff: python/pyspark/java_gateway.py ---
    @@ -64,5 +64,9 @@ def run(self):
         java_import(gateway.jvm, "org.apache.spark.api.java.*")
         java_import(gateway.jvm, "org.apache.spark.api.python.*")
         java_import(gateway.jvm, "org.apache.spark.mllib.api.python.*")
    +    java_import(gateway.jvm, "org.apache.spark.sql.SQLContext")
    +    java_import(gateway.jvm, "org.apache.spark.sql.hive.HiveContext")
    +    java_import(gateway.jvm, "org.apache.spark.sql.hive.LocalHiveContext")
    +    java_import(gateway.jvm, "org.apache.spark.sql.hive.TestHiveContext")
    --- End diff --
    
    I added an better message that tells the user they need to compile with Spark with Hive to use the hive context


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40162084
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40292064
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40432650
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40451813
  
    I've merged this. Thanks @ahirreddy - cool stuff!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554360
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    +
    +class SchemaRDD(RDD):
    +    """
    +    An RDD of Row objects that has an associated schema. The underlying JVM object is a SchemaRDD,
    --- End diff --
    
    You can do `L{Row}` to link to the Row type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40437823
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40161889
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14030/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11561162
  
    --- Diff: python/pyspark/context.py ---
    @@ -460,6 +463,225 @@ def sparkUser(self):
             """
             return self._jsc.sc().sparkUser()
     
    +class SQLContext:
    --- End diff --
    
    Moved


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40167681
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40169163
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554525
  
    --- Diff: python/run-tests ---
    @@ -56,6 +56,9 @@ run_test "pyspark/mllib/clustering.py"
     run_test "pyspark/mllib/recommendation.py"
     run_test "pyspark/mllib/regression.py"
     
    +# Remove the metastore directory created by the HiveContext tests in SparkSQL
    +rm -r metastore
    --- End diff --
    
    Also do we remove "warehouse" as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40447100
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14133/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40432704
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554170
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -235,6 +287,27 @@ JavaSchemaRDD teenagers = sqlCtx.sql("SELECT name FROM parquetFile WHERE age >=
     
     </div>
     
    +<div data-lang="python"  markdown="1">
    +
    +{% highlight python %}
    +
    +peopleTable # The SchemaRDD from the previous example.
    +
    +# JavaSchemaRDDs can be saved as parquet files, maintaining the schema information.
    --- End diff --
    
    They're just called SchemaRDDs, not JavaSchemaRDDs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40401432
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40401529
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40046633
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40003327
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40251820
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14063/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40173673
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40160083
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40437824
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14125/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-39996864
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40447099
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40162091
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40162076
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40051898
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13990/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554341
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    --- End diff --
    
    Is this a standard way to do this in Python? Just wanted to make sure. It seems weird that w'ere calling __init__ on a dict that's already initialized.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11561160
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    +
    +class SchemaRDD(RDD):
    +    """
    +    An RDD of Row objects that has an associated schema. The underlying JVM object is a SchemaRDD,
    +    not a PythonRDD, so we can utilize the relational query api exposed by SparkSQL.
    +
    +    For normal L{RDD} operations (map, count, etc.) the L{SchemaRDD} is not operated on directly, as
    +    it's underlying implementation is a RDD composed of Java objects. Instead it is converted to a
    +    PythonRDD in the JVM, on which Python operations can be done.
    +    """
    +
    +    def __init__(self, jschema_rdd, sql_ctx):
    +        self.sql_ctx = sql_ctx
    +        self._sc = sql_ctx._sc
    +        self._jschema_rdd = jschema_rdd
    +
    +        self.is_cached = False
    +        self.is_checkpointed = False
    +        self.ctx = self.sql_ctx._sc
    +        self._jrdd_deserializer = self.ctx.serializer
    +
    +    @property
    +    def _jrdd(self):
    +        """
    +        Lazy evaluation of PythonRDD object. Only done when a user calls methods defined by the
    +        L{RDD} super class (map, count, etc.).
    +        """
    +        return self.toPython()._jrdd
    +
    +    @property
    +    def _id(self):
    +        return self._jrdd.id()
    +
    +    def saveAsParquetFile(self, path):
    +        """
    +        Saves the contents of this L{SchemaRDD} as a parquet file, preserving the schema.  Files
    +        that are written out using this method can be read back in as a SchemaRDD using the
    +        L{SQLContext.parquetFile} method.
    +
    +        >>> from pyspark.context import SQLContext
    +        >>> sqlCtx = SQLContext(sc)
    +        >>> rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"},
    +        ... {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    +        >>> srdd = sqlCtx.inferSchema(rdd)
    +        >>> srdd.saveAsParquetFile("/tmp/test.parquet")
    +        >>> srdd2 = sqlCtx.parquetFile("/tmp/test.parquet")
    +        >>> srdd2.collect() == srdd.collect()
    +        True
    +        """
    +        self._jschema_rdd.saveAsParquetFile(path)
    +
    +    def registerAsTable(self, name):
    +        """
    +        Registers this RDD as a temporary table using the given name.  The lifetime of this temporary
    +        table is tied to the L{SQLContext} that was used to create this SchemaRDD.
    +
    +        >>> from pyspark.context import SQLContext
    +        >>> sqlCtx = SQLContext(sc)
    +        >>> rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"},
    +        ... {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    +        >>> srdd = sqlCtx.inferSchema(rdd)
    +        >>> srdd.registerAsTable("test")
    +        >>> srdd2 = sqlCtx.sql("select * from test")
    +        >>> srdd.collect() == srdd2.collect()
    +        True
    +        """
    +        self._jschema_rdd.registerAsTable(name)
    +
    +    def toPython(self):
    --- End diff --
    
    Made private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40249409
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40249487
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40159817
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40439427
  
    Regarding the longer test time, we should make sure that we aren't just comparing to times when the Hive tests weren't running at all.
    
    Should definitely look into the increased verbosity of the logs (even thought that might not have been caused by this PR, but by turning the hive tests back on).  It is possible that we should just add more packages to `sql/hive/src/main/resources/log4j.properties`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40440452
  
    @marmbrus I see- the duration issue was just that we had stopped running hive tests for a bit after Aaron's build change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40177915
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40289938
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40142612
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14014/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40432076
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554379
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    +
    +class SchemaRDD(RDD):
    +    """
    +    An RDD of Row objects that has an associated schema. The underlying JVM object is a SchemaRDD,
    +    not a PythonRDD, so we can utilize the relational query api exposed by SparkSQL.
    +
    +    For normal L{RDD} operations (map, count, etc.) the L{SchemaRDD} is not operated on directly, as
    --- End diff --
    
    This will become `L{pyspark.rdd.RDD}` if you move SchemaRDD to a sql module.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-39992046
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40046637
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554303
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    +
    +class SchemaRDD(RDD):
    --- End diff --
    
    Can you move this to a separate pyspark/sql.py module instead of keeping it in the rdd one? Same with Row. No need to make rdd.py grow this big.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11560881
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    +
    +class SchemaRDD(RDD):
    +    """
    +    An RDD of Row objects that has an associated schema. The underlying JVM object is a SchemaRDD,
    +    not a PythonRDD, so we can utilize the relational query api exposed by SparkSQL.
    +
    +    For normal L{RDD} operations (map, count, etc.) the L{SchemaRDD} is not operated on directly, as
    +    it's underlying implementation is a RDD composed of Java objects. Instead it is converted to a
    +    PythonRDD in the JVM, on which Python operations can be done.
    +    """
    +
    +    def __init__(self, jschema_rdd, sql_ctx):
    +        self.sql_ctx = sql_ctx
    +        self._sc = sql_ctx._sc
    +        self._jschema_rdd = jschema_rdd
    +
    +        self.is_cached = False
    +        self.is_checkpointed = False
    --- End diff --
    
    I can do that. One thing is, when a user cache's the RDD, do we cache both the underlying SchemaRDD and PythonRDD (that's used in map, count, etc.)? Right now just the PythonRDD is cached/persisted/checkpointed when a user calls any of those methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40432159
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14123/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554502
  
    --- Diff: python/run-tests ---
    @@ -56,6 +56,9 @@ run_test "pyspark/mllib/clustering.py"
     run_test "pyspark/mllib/recommendation.py"
     run_test "pyspark/mllib/regression.py"
     
    +# Remove the metastore directory created by the HiveContext tests in SparkSQL
    +rm -r metastore
    --- End diff --
    
    Actually you also want to do this at the beginning rather than at the end. Look at where we remove unit-tests.log


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40241301
  
    There already is a JIRA :)
    
    https://issues.apache.org/jira/browse/SPARK-1374


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40432063
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40249489
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14061/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40051897
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-39922963
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11561164
  
    --- Diff: python/run-tests ---
    @@ -56,6 +56,9 @@ run_test "pyspark/mllib/clustering.py"
     run_test "pyspark/mllib/recommendation.py"
     run_test "pyspark/mllib/regression.py"
     
    +# Remove the metastore directory created by the HiveContext tests in SparkSQL
    +rm -r metastore
    --- End diff --
    
    Moved to top, and also remove warehouse


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554195
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -318,4 +391,24 @@ Row[] results = hiveCtx.hql("FROM src SELECT key, value").collect();
     
     </div>
     
    +<div data-lang="python"  markdown="1">
    +
    +When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
    +adds support for finding tables in in the MetaStore and writing queries using HiveQL. In addition to
    +the `sql` method a `HiveContext` also provides an `hql` methods, which allows queries to be
    +expressed in HiveQL.
    --- End diff --
    
    I'm not sure if this is said earlier in the doc, but you should say how to build Spark for Hive support.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-39924752
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13922/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40239515
  
    @ahirreddy could you make a spark JIRA for this? Seems like a large enough feature that we'd want to track it. Add the component as SQL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40439431
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40167675
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40133497
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554158
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -176,6 +202,32 @@ List<String> teenagerNames = teenagers.map(new Function<Row, String>() {
     
     </div>
     
    +<div data-lang="python"  markdown="1">
    +
    +One type of table that is supported by Spark SQL is an RDD of dictionaries.  The keys of the
    +dictionary define the columns names of the table, and the types are inferred by looking at the first
    +row. Any RDD of dictionaries can converted to a SchemaRDD and then registered as a table.  Tables
    +can be used in subsequent SQL statements.
    +
    +{% highlight python %}
    +# Load a text file and convert each line to a dictionary.
    +lines = sc.textFile("examples/src/main/resources/people.txt")
    +parts = lines.map(lambda l: l.split(","))
    +people = parts.map(lambda p: {"name": p[0], "age": int(p[1])})
    +
    +# Infer the schema, and register the SchemaRDD as a table.
    +peopleTable = sqlCtx.inferSchema(people)
    +peopleTable.registerAsTable("people")
    +
    +# SQL can be run over SchemaRDDs that have been registered as a table.
    +teenagers = sqlCtx.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
    +
    +# The results of SQL queries are RDDs and support all the normal RDD operations.
    +teenNames = teenagers.map(lambda p: "Name: " + p.name)
    +{% endhighlight %}
    +
    --- End diff --
    
    Maybe add something saying that in future versions of PySpark, we'd like to support RDDs with other data types in registerAsTable too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40003329
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13955/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40411346
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14114/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40251723
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554448
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    +
    +class SchemaRDD(RDD):
    +    """
    +    An RDD of Row objects that has an associated schema. The underlying JVM object is a SchemaRDD,
    +    not a PythonRDD, so we can utilize the relational query api exposed by SparkSQL.
    +
    +    For normal L{RDD} operations (map, count, etc.) the L{SchemaRDD} is not operated on directly, as
    +    it's underlying implementation is a RDD composed of Java objects. Instead it is converted to a
    +    PythonRDD in the JVM, on which Python operations can be done.
    +    """
    +
    +    def __init__(self, jschema_rdd, sql_ctx):
    +        self.sql_ctx = sql_ctx
    +        self._sc = sql_ctx._sc
    +        self._jschema_rdd = jschema_rdd
    +
    +        self.is_cached = False
    +        self.is_checkpointed = False
    --- End diff --
    
    Why are you setting these to false, and do we want to implement cache() and checkpoint() here and call them on _jrdd?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40432706
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14124/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40300482
  
    Is this failing due to not cleaning up some files?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40173668
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-39996411
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554491
  
    --- Diff: python/run-tests ---
    @@ -56,6 +56,9 @@ run_test "pyspark/mllib/clustering.py"
     run_test "pyspark/mllib/recommendation.py"
     run_test "pyspark/mllib/regression.py"
     
    +# Remove the metastore directory created by the HiveContext tests in SparkSQL
    +rm -r metastore
    --- End diff --
    
    Probably want -f in there as well so that it's quiet if metastore doesn't exist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11425374
  
    --- Diff: python/pyspark/context.py ---
    @@ -460,6 +463,189 @@ def sparkUser(self):
             """
             return self._jsc.sc().sparkUser()
     
    +class SQLContext:
    +    """
    +    Main entry point for SparkSQL functionality. A SQLContext can be used create L{SchemaRDD}s,
    +    register L{SchemaRDD}s as tables, execute sql over tables, cache tables, and read parquet files.
    +    """
    +
    +    def __init__(self, sparkContext):
    +        """
    +        Create a new SQLContext.
    +
    +        @param sparkContext: The SparkContext to wrap.
    +
    +        >>> from pyspark.context import SQLContext
    +        >>> sqlCtx = SQLContext(sc)
    +
    +        >>> rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"},
    +        ... {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    +
    +        >>> srdd = sqlCtx.applySchema(rdd)
    +        >>> sqlCtx.applySchema(srdd) # doctest: +IGNORE_EXCEPTION_DETAIL
    +        Traceback (most recent call last):
    +            ...
    +        ValueError:...
    +
    +        >>> bad_rdd = sc.parallelize([1,2,3])
    +        >>> sqlCtx.applySchema(bad_rdd) # doctest: +IGNORE_EXCEPTION_DETAIL
    +        Traceback (most recent call last):
    +            ...
    +        ValueError:...
    +
    +        >>> allTypes = sc.parallelize([{"int" : 1, "string" : "string", "double" : 1.0, "long": 1L,
    +        ... "boolean" : True}])
    +        >>> srdd = sqlCtx.applySchema(allTypes).map(lambda x: (x.int, x.string, x.double, x.long,
    +        ... x.boolean))
    +        >>> srdd.collect()[0]
    +        (1, u'string', 1.0, 1, True)
    +        """
    +        self._sc = sparkContext
    +        self._jsc = self._sc._jsc
    +        self._jvm = self._sc._jvm
    +
    +    @property
    +    def _ssql_ctx(self):
    +        """
    +        Accessor for the JVM SparkSQL context.  Subclasses can overrite this property to provide
    +        their own JVM Contexts.
    +        """
    +        if not hasattr(self, '_scala_SQLContext'):
    +            self._scala_SQLContext = self._jvm.SQLContext(self._jsc.sc())
    +        return self._scala_SQLContext
    +
    +    def applySchema(self, rdd):
    --- End diff --
    
    I suggested _exactly_ the same thing offline! great minds....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554389
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    +
    +class SchemaRDD(RDD):
    +    """
    +    An RDD of Row objects that has an associated schema. The underlying JVM object is a SchemaRDD,
    +    not a PythonRDD, so we can utilize the relational query api exposed by SparkSQL.
    --- End diff --
    
    I'm not 100% sure on this (CC @marmbrus) but are we spelling it Spark SQL or SparkSQL?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11426010
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -284,6 +286,42 @@ private[spark] object PythonRDD {
         file.close()
       }
     
    +  def pythonToJava(pyRDD: JavaRDD[Array[Byte]]): JavaRDD[_] = {
    +    pyRDD.rdd.mapPartitions { iter =>
    +      val unpickle = new Unpickler
    +      // TODO: Figure out why flatMap is necessay for pyspark
    +      iter.flatMap { row =>
    +        unpickle.loads(row) match {
    +          case objs: java.util.ArrayList[Any] => objs
    +          // Incase the partition doesn't have a collection
    +          case obj => Seq(obj)
    +        }
    +      }
    +    }
    +  }
    +
    +  def pythonToJavaMap(pyRDD: JavaRDD[Array[Byte]]): JavaRDD[Map[String, _]] = {
    +    pyRDD.rdd.mapPartitions { iter =>
    +      val unpickle = new Unpickler
    +      // TODO: Figure out why flatMap is necessay for pyspark
    +      iter.flatMap { row =>
    +        unpickle.loads(row) match {
    +          case objs: java.util.ArrayList[JMap[String, _]] => objs.map(_.toMap)
    --- End diff --
    
    Probably use `@unchecked` here:
     - `java.util.ArrayList[JMap[String, _] @unchecked]`
     - `JMap[String @unchecked, _] => Seq(obj.toMap)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40258191
  
    Hey Patrick, FYI, I want to review this a bit too since I've been doing some Python stuff lately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40411343
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40162077
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14031/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554183
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -235,6 +287,27 @@ JavaSchemaRDD teenagers = sqlCtx.sql("SELECT name FROM parquetFile WHERE age >=
     
     </div>
     
    +<div data-lang="python"  markdown="1">
    +
    +{% highlight python %}
    +
    +peopleTable # The SchemaRDD from the previous example.
    +
    +# JavaSchemaRDDs can be saved as parquet files, maintaining the schema information.
    +peopleTable.saveAsParquetFile("people.parquet")
    +
    +# Read in the parquet file created above.  Parquet files are self-describing so the schema is preserved.
    +# The result of loading a parquet file is also a JavaSchemaRDD.
    --- End diff --
    
    Ditto here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40013385
  
    Also this should eventually support the insertInto we add in Java, to save something as a Hive table: https://issues.apache.org/jira/browse/SPARK-1424. But we can make that a separate pull request.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40177916
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14050/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40161888
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/363


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40251741
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11560795
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    +
    +class SchemaRDD(RDD):
    +    """
    +    An RDD of Row objects that has an associated schema. The underlying JVM object is a SchemaRDD,
    +    not a PythonRDD, so we can utilize the relational query api exposed by SparkSQL.
    +
    +    For normal L{RDD} operations (map, count, etc.) the L{SchemaRDD} is not operated on directly, as
    +    it's underlying implementation is a RDD composed of Java objects. Instead it is converted to a
    +    PythonRDD in the JVM, on which Python operations can be done.
    +    """
    +
    +    def __init__(self, jschema_rdd, sql_ctx):
    +        self.sql_ctx = sql_ctx
    +        self._sc = sql_ctx._sc
    +        self._jschema_rdd = jschema_rdd
    +
    +        self.is_cached = False
    +        self.is_checkpointed = False
    +        self.ctx = self.sql_ctx._sc
    +        self._jrdd_deserializer = self.ctx.serializer
    +
    +    @property
    +    def _jrdd(self):
    +        """
    +        Lazy evaluation of PythonRDD object. Only done when a user calls methods defined by the
    +        L{RDD} super class (map, count, etc.).
    +        """
    +        return self.toPython()._jrdd
    --- End diff --
    
    It get's recomputed every time. I'll add some code, so first time it's accessed I'll store the value, so subsequent calls won't compute.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40160071
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554591
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    +
    +class SchemaRDD(RDD):
    +    """
    +    An RDD of Row objects that has an associated schema. The underlying JVM object is a SchemaRDD,
    +    not a PythonRDD, so we can utilize the relational query api exposed by SparkSQL.
    +
    +    For normal L{RDD} operations (map, count, etc.) the L{SchemaRDD} is not operated on directly, as
    +    it's underlying implementation is a RDD composed of Java objects. Instead it is converted to a
    +    PythonRDD in the JVM, on which Python operations can be done.
    +    """
    +
    +    def __init__(self, jschema_rdd, sql_ctx):
    +        self.sql_ctx = sql_ctx
    +        self._sc = sql_ctx._sc
    +        self._jschema_rdd = jschema_rdd
    +
    +        self.is_cached = False
    +        self.is_checkpointed = False
    +        self.ctx = self.sql_ctx._sc
    +        self._jrdd_deserializer = self.ctx.serializer
    +
    +    @property
    +    def _jrdd(self):
    +        """
    +        Lazy evaluation of PythonRDD object. Only done when a user calls methods defined by the
    +        L{RDD} super class (map, count, etc.).
    +        """
    +        return self.toPython()._jrdd
    +
    +    @property
    +    def _id(self):
    +        return self._jrdd.id()
    +
    +    def saveAsParquetFile(self, path):
    +        """
    +        Saves the contents of this L{SchemaRDD} as a parquet file, preserving the schema.  Files
    +        that are written out using this method can be read back in as a SchemaRDD using the
    +        L{SQLContext.parquetFile} method.
    +
    +        >>> from pyspark.context import SQLContext
    +        >>> sqlCtx = SQLContext(sc)
    +        >>> rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"},
    +        ... {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    +        >>> srdd = sqlCtx.inferSchema(rdd)
    +        >>> srdd.saveAsParquetFile("/tmp/test.parquet")
    +        >>> srdd2 = sqlCtx.parquetFile("/tmp/test.parquet")
    +        >>> srdd2.collect() == srdd.collect()
    +        True
    +        """
    +        self._jschema_rdd.saveAsParquetFile(path)
    +
    +    def registerAsTable(self, name):
    +        """
    +        Registers this RDD as a temporary table using the given name.  The lifetime of this temporary
    +        table is tied to the L{SQLContext} that was used to create this SchemaRDD.
    +
    +        >>> from pyspark.context import SQLContext
    +        >>> sqlCtx = SQLContext(sc)
    +        >>> rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"},
    +        ... {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    +        >>> srdd = sqlCtx.inferSchema(rdd)
    +        >>> srdd.registerAsTable("test")
    +        >>> srdd2 = sqlCtx.sql("select * from test")
    +        >>> srdd.collect() == srdd2.collect()
    +        True
    +        """
    +        self._jschema_rdd.registerAsTable(name)
    +
    +    def toPython(self):
    --- End diff --
    
    Is this supposed to be a public method? From the examples it seems that you can call map and collect and such on the SchemaRDD itself. In that case this should be called _toPython to make it private, or you can inline it in the computation of _jrdd.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40291396
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40439428
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11465774
  
    --- Diff: python/pyspark/java_gateway.py ---
    @@ -64,5 +64,9 @@ def run(self):
         java_import(gateway.jvm, "org.apache.spark.api.java.*")
         java_import(gateway.jvm, "org.apache.spark.api.python.*")
         java_import(gateway.jvm, "org.apache.spark.mllib.api.python.*")
    +    java_import(gateway.jvm, "org.apache.spark.sql.SQLContext")
    +    java_import(gateway.jvm, "org.apache.spark.sql.hive.HiveContext")
    +    java_import(gateway.jvm, "org.apache.spark.sql.hive.LocalHiveContext")
    +    java_import(gateway.jvm, "org.apache.spark.sql.hive.TestHiveContext")
    --- End diff --
    
    This will still work, but it will throw a non-fatal exception when the user tries to use a HiveContext without hive built. I'll catch that and present a better error message that indicates that the user would need to build spark with hive with SPARK_HIVE=true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554537
  
    --- Diff: python/pyspark/context.py ---
    @@ -460,6 +463,225 @@ def sparkUser(self):
             """
             return self._jsc.sc().sparkUser()
     
    +class SQLContext:
    --- End diff --
    
    Move this to a sql module as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11560720
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -318,4 +391,24 @@ Row[] results = hiveCtx.hql("FROM src SELECT key, value").collect();
     
     </div>
     
    +<div data-lang="python"  markdown="1">
    +
    +When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
    +adds support for finding tables in in the MetaStore and writing queries using HiveQL. In addition to
    +the `sql` method a `HiveContext` also provides an `hql` methods, which allows queries to be
    +expressed in HiveQL.
    --- End diff --
    
    It has it's own section earlier in the dock starting on line 338.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40401538
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11560776
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    --- End diff --
    
    In Python the first constructor (__init__) that's invoked is the that of the subclass, and it needs to explicitly call the parent constructor if it wants to use it. So in this case, I need to explicitly call on the dict type's __init__ to run through that constructor


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40249657
  
    @marmbrus ah my bad - @ahirreddy could you update the pull request then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40163786
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11424723
  
    --- Diff: python/pyspark/context.py ---
    @@ -460,6 +463,189 @@ def sparkUser(self):
             """
             return self._jsc.sc().sparkUser()
     
    +class SQLContext:
    +    """
    +    Main entry point for SparkSQL functionality. A SQLContext can be used create L{SchemaRDD}s,
    +    register L{SchemaRDD}s as tables, execute sql over tables, cache tables, and read parquet files.
    +    """
    +
    +    def __init__(self, sparkContext):
    +        """
    +        Create a new SQLContext.
    +
    +        @param sparkContext: The SparkContext to wrap.
    +
    +        >>> from pyspark.context import SQLContext
    +        >>> sqlCtx = SQLContext(sc)
    +
    +        >>> rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"},
    +        ... {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    +
    +        >>> srdd = sqlCtx.applySchema(rdd)
    +        >>> sqlCtx.applySchema(srdd) # doctest: +IGNORE_EXCEPTION_DETAIL
    +        Traceback (most recent call last):
    +            ...
    +        ValueError:...
    +
    +        >>> bad_rdd = sc.parallelize([1,2,3])
    +        >>> sqlCtx.applySchema(bad_rdd) # doctest: +IGNORE_EXCEPTION_DETAIL
    +        Traceback (most recent call last):
    +            ...
    +        ValueError:...
    +
    +        >>> allTypes = sc.parallelize([{"int" : 1, "string" : "string", "double" : 1.0, "long": 1L,
    +        ... "boolean" : True}])
    +        >>> srdd = sqlCtx.applySchema(allTypes).map(lambda x: (x.int, x.string, x.double, x.long,
    +        ... x.boolean))
    +        >>> srdd.collect()[0]
    +        (1, u'string', 1.0, 1, True)
    +        """
    +        self._sc = sparkContext
    +        self._jsc = self._sc._jsc
    +        self._jvm = self._sc._jvm
    +
    +    @property
    +    def _ssql_ctx(self):
    +        """
    +        Accessor for the JVM SparkSQL context.  Subclasses can overrite this property to provide
    +        their own JVM Contexts.
    +        """
    +        if not hasattr(self, '_scala_SQLContext'):
    +            self._scala_SQLContext = self._jvm.SQLContext(self._jsc.sc())
    +        return self._scala_SQLContext
    +
    +    def applySchema(self, rdd):
    --- End diff --
    
    How about `inferSchema`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40289149
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40432281
  
    MIMA Checker issue because we now include Hive in the assembly jar when building on Jenkins. See Jira SPARK-1494 for more information.
    https://issues.apache.org/jira/browse/SPARK-1494


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40163787
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14034/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-39996867
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13953/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-39996385
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40259367
  
    Hey Ahir, I made a pass through this and made some comments. Looks really good overall. One style comment too that you should apply globally -- leave two blank lines between top-level items in a `.py` file (classes, functions, etc). Items within a class can have just one blank line between them but it gets a bit crowded at the top level.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40292065
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14081/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11554418
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1387,6 +1387,95 @@ def _jrdd(self):
         def _is_pipelinable(self):
             return not (self.is_cached or self.is_checkpointed)
     
    +class Row(dict):
    +    """
    +    An extended L{dict} that takes a L{dict} in its constructor, and exposes those items as fields.
    +
    +    >>> r = Row({"hello" : "world", "foo" : "bar"})
    +    >>> r.hello
    +    'world'
    +    >>> r.foo
    +    'bar'
    +    """
    +
    +    def __init__(self, d):
    +        d.update(self.__dict__)
    +        self.__dict__ = d
    +        dict.__init__(self, d)
    +
    +class SchemaRDD(RDD):
    +    """
    +    An RDD of Row objects that has an associated schema. The underlying JVM object is a SchemaRDD,
    +    not a PythonRDD, so we can utilize the relational query api exposed by SparkSQL.
    +
    +    For normal L{RDD} operations (map, count, etc.) the L{SchemaRDD} is not operated on directly, as
    +    it's underlying implementation is a RDD composed of Java objects. Instead it is converted to a
    +    PythonRDD in the JVM, on which Python operations can be done.
    +    """
    +
    +    def __init__(self, jschema_rdd, sql_ctx):
    +        self.sql_ctx = sql_ctx
    +        self._sc = sql_ctx._sc
    +        self._jschema_rdd = jschema_rdd
    +
    +        self.is_cached = False
    +        self.is_checkpointed = False
    +        self.ctx = self.sql_ctx._sc
    +        self._jrdd_deserializer = self.ctx.serializer
    +
    +    @property
    +    def _jrdd(self):
    +        """
    +        Lazy evaluation of PythonRDD object. Only done when a user calls methods defined by the
    +        L{RDD} super class (map, count, etc.).
    +        """
    +        return self.toPython()._jrdd
    --- End diff --
    
    Is this computed only once, like a lazy val in Scala, or does it get recomputed each time you access _jrdd?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40289144
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40433872
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40452665
  
    Awesome, thanks!—
    Sent from Mailbox for iPhone
    
    On Tue, Apr 15, 2014 at 12:16 AM, asfgit <no...@github.com> wrote:
    
    > Closed #363 via c99bcb7feaa761c5826f2e1d844d0502a3b79538.
    > ---
    > Reply to this email directly or view it on GitHub:
    > https://github.com/apache/spark/pull/363


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40241427
  
    Also, I think this PR is breaking jenkins by leaving files around that are making RAT fail.  I don't think its the fault of this PR, but a problem with the RAT code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-39924750
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40249393
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40444360
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40009824
  
    Hey Ahir, make sure you also update the Maven build to support this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request:

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11457416
  
    --- Diff: python/pyspark/java_gateway.py ---
    @@ -64,5 +64,9 @@ def run(self):
         java_import(gateway.jvm, "org.apache.spark.api.java.*")
         java_import(gateway.jvm, "org.apache.spark.api.python.*")
         java_import(gateway.jvm, "org.apache.spark.mllib.api.python.*")
    +    java_import(gateway.jvm, "org.apache.spark.sql.SQLContext")
    +    java_import(gateway.jvm, "org.apache.spark.sql.hive.HiveContext")
    +    java_import(gateway.jvm, "org.apache.spark.sql.hive.LocalHiveContext")
    +    java_import(gateway.jvm, "org.apache.spark.sql.hive.TestHiveContext")
    --- End diff --
    
    Will this work if users haven't built with Hive? Maybe we want to make the Hive support optional. Not sure what the best way to do so is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-39922972
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40432661
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40142611
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40446008
  
    Hey guys, I looked through the code and tried this out, and it looks good to me. So if we can fix the test issues I'd say it's ready to merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-39992026
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40440626
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40159811
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40133501
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40439254
  
    So a few concerns on the test output. The first is that the tests took way longer than normal (could just be a slow jenkins worker) and the second is that there was a bunch of random output. I wonder if maybe we are including a log4j file in the hive assembly jar... we should look into this before merging. Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40433881
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/363#issuecomment-40440638
  
    I manually cancelled this build since we'll need to reterst.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---