You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2016/02/24 15:20:26 UTC

[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/11347

    [SPARK-13233][SQL][WIP] Python Dataset (basic version)

    ## What changes were proposed in this pull request?
    
    This PR introduce a new API: Python dataset. Conceptually it's a combination of Python DataFrame and Python RDD, supports both typed operations(e.g. map, flatMap, filter, etc.) and untyped operations(e.g. select, sort, etc.). This is a simpler version of https://github.com/apache/spark/pull/11117, without the aggregate part.
    
    
    ## How was this patch tested?
    
    new tests are added in pyspark/sql/tests.py
    
    
    TODO:
    
    * add documents
    * more tests
    * fix all corner cases


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark pydataset

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11347.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11347
    
----
commit ddfdbbdf9c052f078dd914dfa6ae54de6c633d46
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-02-24T09:19:11Z

    tmp

commit fb0e7f497538390e631d16c9d27ef3c03e4e4b8a
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-02-24T13:58:52Z

    python dataset

commit a073f831adfd7640c40f9455ddcc02b9667db9ad
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-02-24T14:19:25Z

    update

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54321529
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -859,14 +905,20 @@ def filter(self, condition):
             [Row(age=5, name=u'Bob')]
             >>> df.where("age = 2").collect()
             [Row(age=2, name=u'Alice')]
    +
    +        >>> df.filter(lambda row: row.age > 3).collect()
    +        [Row(age=5, name=u'Bob')]
    +        >>> df.map(lambda row: row.age).filter(lambda age: age > 3).collect()
    --- End diff --
    
    after map, the schema is `struct<value: binary>`(the default one), the type is int.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188647091
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188646971
  
    **[Test build #51926 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51926/consoleFull)** for PR 11347 at commit [`e549d48`](https://github.com/apache/spark/commit/e549d48b44199a339e908ec9a807a84cb7fb80a1).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189204225
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54308608
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -257,53 +264,85 @@ def limit(self, num):
         @ignore_unicode_prefix
         @since(1.3)
         def take(self, num):
    -        """Returns the first ``num`` rows as a :class:`list` of :class:`Row`.
    +        """Returns the first ``num`` records as a :class:`list`.
     
             >>> df.take(2)
             [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
             """
    -        with SCCallSiteSync(self._sc) as css:
    -            port = self._sc._jvm.org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe(
    -                self._jdf, num)
    -        return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
    +        return self.limit(num).collect()
     
         @ignore_unicode_prefix
    -    @since(1.3)
    -    def map(self, f):
    -        """ Returns a new :class:`RDD` by applying a the ``f`` function to each :class:`Row`.
    +    @since(2.0)
    +    def applySchema(self, schema=None):
    +        """Returns a new :class:`DataFrame` by appling the given schema, or infer the schema
    +        by all of the records if no schema is given.
     
    -        This is a shorthand for ``df.rdd.map()``.
    +        It is only allowed to apply schema for DataFrame which is returned by typed operations,
    +        e.g. map, flatMap, etc. And the record type of the schema-applied DataFrame will be row.
     
    -        >>> df.map(lambda p: p.name).collect()
    +        >>> ds = df.map(lambda row: row.name)
    +        >>> ds.collect()
             [u'Alice', u'Bob']
    +        >>> ds.schema
    +        StructType(List(StructField(value,BinaryType,false)))
    +        >>> ds2 = ds.applySchema(StringType())
    +        >>> ds2.collect()
    +        [Row(value=u'Alice'), Row(value=u'Bob')]
    +        >>> ds2.schema
    +        StructType(List(StructField(value,StringType,true)))
    +        >>> ds3 = ds.applySchema()
    +        >>> ds3.collect()
    +        [Row(value=u'Alice'), Row(value=u'Bob')]
    +        >>> ds3.schema
    +        StructType(List(StructField(value,StringType,true)))
    +        """
    +        msg = "Cannot apply schema to a DataFrame which is not returned by typed operations"
    +        raise RuntimeError(msg)
    --- End diff --
    
    Just Excaption


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188573087
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51924/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54308425
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -257,53 +264,85 @@ def limit(self, num):
         @ignore_unicode_prefix
         @since(1.3)
         def take(self, num):
    -        """Returns the first ``num`` rows as a :class:`list` of :class:`Row`.
    +        """Returns the first ``num`` records as a :class:`list`.
     
             >>> df.take(2)
             [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
             """
    -        with SCCallSiteSync(self._sc) as css:
    -            port = self._sc._jvm.org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe(
    -                self._jdf, num)
    -        return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
    +        return self.limit(num).collect()
    --- End diff --
    
    This requires can all the partitions, see SPARK-10731


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188285751
  
    **[Test build #51880 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51880/consoleFull)** for PR 11347 at commit [`f29ed29`](https://github.com/apache/spark/commit/f29ed294434066a7806eb7b05b195c4aa356f78c).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Dataset(object):`
      * `class PipelinedDataset(Dataset):`
      * `case class PythonMapPartitions(`
      * `case class PythonMapPartitions(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188640208
  
    **[Test build #51925 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51925/consoleFull)** for PR 11347 at commit [`e549d48`](https://github.com/apache/spark/commit/e549d48b44199a339e908ec9a807a84cb7fb80a1).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188285760
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54324424
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1753,15 +1753,26 @@ class DataFrame private[sql](
        * Converts a JavaRDD to a PythonRDD.
        */
       protected[sql] def javaToPython: JavaRDD[Array[Byte]] = {
    -    val structType = schema  // capture it for closure
    -    val rdd = queryExecution.toRdd.map(EvaluatePython.toJava(_, structType))
    -    EvaluatePython.javaToPython(rdd)
    +    if (isOutputPickled) {
    +      queryExecution.toRdd.map(_.getBinary(0))
    --- End diff --
    
    I introduced this for aggregate at first.
    Think about `ds.map(...).groupByKey(key_func, key_schema).mapGroups(...)`, when do aggregate, we need to run the key function and append key data to the input rows. In this case, the input rows are the python data after `map`. So we need to store the python data at JVM side even without schema, but the row count should be corrected.
    
    It's also more clear when we call `ds.map(...).count()`, users would expect same row count after typed operations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189128461
  
    @cloud-fan can we break this into multiple patches? I find the size a bit hard to review.
    
    Also maybe let's not do the renaming for now, because it might make more sense in Python to have DataFrame be the main class, and Dataset just be an alias (there is no practical difference in Python), since Python users are more familiar with data frames.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189510320
  
    In Scala, it's clear that DataFrame is Dataset[Row], and some of function could work with DataFrame, some may not, compiler could check the types.
    
    But in Python, it's confusing to me, sometimes the record is Row object, sometimes the record is just arbitrary object (for example, int). Especially when we create a new DataFrame, for example, `range()` or `text()`, these will return an DataFrame of Row or DataFrame of int/string?
    
    Before this PR, it's clear that Python DataFrame always has Row with known schema with it. `df.rdd` or `df.map` will return an RDD, which could have arbitrary object in it. Will it make sense to have Dataset to replace RDD for DataFrame, to replace DataFrame?
    
    for example:
    
     df.rdd returns an RDD
     df.ds returns a Dataset
     df.map() return a Dataset
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189083979
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52009/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189201668
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188277420
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-190043823
  
    **[Test build #52168 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52168/consoleFull)** for PR 11347 at commit [`9beffc6`](https://github.com/apache/spark/commit/9beffc69fc829b92413f54f89fc55bb59d2a2b56).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189128246
  
    cc @davies too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-190311762
  
    I think we had solved this problem in RDD, you could take that as an example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189098744
  
    **[Test build #52018 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52018/consoleFull)** for PR 11347 at commit [`60ea3d3`](https://github.com/apache/spark/commit/60ea3d3d675e172124d80ba7a98324f8670ad8f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189122914
  
    cc @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54323393
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -859,14 +905,20 @@ def filter(self, condition):
             [Row(age=5, name=u'Bob')]
             >>> df.where("age = 2").collect()
             [Row(age=2, name=u'Alice')]
    +
    +        >>> df.filter(lambda row: row.age > 3).collect()
    +        [Row(age=5, name=u'Bob')]
    +        >>> df.map(lambda row: row.age).filter(lambda age: age > 3).collect()
    --- End diff --
    
    This is the confusing part, the schema does match the record object type. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-190493809
  
    Will send a simpler version that just use RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188861420
  
    **[Test build #51975 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51975/consoleFull)** for PR 11347 at commit [`22ef406`](https://github.com/apache/spark/commit/22ef406f6b6166741bd89fc5314f6ad637a8e381).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189161421
  
    **[Test build #52039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52039/consoleFull)** for PR 11347 at commit [`c178beb`](https://github.com/apache/spark/commit/c178bebb4f820252edc38b972a01fdf5aca51c1a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188572138
  
    **[Test build #51924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51924/consoleFull)** for PR 11347 at commit [`32a04de`](https://github.com/apache/spark/commit/32a04de9f772be4eb576485ec135eed9349073e5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54307414
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -79,11 +81,18 @@ def __init__(self, jdf, sql_ctx):
         @property
         @since(1.3)
         def rdd(self):
    -        """Returns the content as an :class:`pyspark.RDD` of :class:`Row`.
    +        """Returns the content as an :class:`pyspark.RDD` of :class:`Row` or custom object.
             """
             if self._lazy_rdd is None:
                 jrdd = self._jdf.javaToPython()
    -            self._lazy_rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer()))
    +            if self._jdf.isOutputPickled():
    +                # If the underlying java DataFrame's output is pickled, which means the query
    +                # engine don't know the real schema of the data and just keep the pickled binary
    +                # for each custom object(no batch).  So we need to use non-batched serializer here.
    +                deserializer = PickleSerializer()
    --- End diff --
    
    The overhead of PickleSerializer is pretty high, it will serialize the class for each row, could you do some benchmark to see how is the difference between non-batched vs batched (both size and CPU time)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188276819
  
    **[Test build #51878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51878/consoleFull)** for PR 11347 at commit [`a073f83`](https://github.com/apache/spark/commit/a073f831adfd7640c40f9455ddcc02b9667db9ad).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54321647
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -859,14 +905,20 @@ def filter(self, condition):
             [Row(age=5, name=u'Bob')]
             >>> df.where("age = 2").collect()
             [Row(age=2, name=u'Alice')]
    +
    +        >>> df.filter(lambda row: row.age > 3).collect()
    +        [Row(age=5, name=u'Bob')]
    +        >>> df.map(lambda row: row.age).filter(lambda age: age > 3).collect()
    +        [5]
             """
             if isinstance(condition, basestring):
    -            jdf = self._jdf.filter(condition)
    +            return DataFrame(self._jdf.filter(condition), self.sql_ctx)
    --- End diff --
    
    DataFrame always have a schema(we have a default), the difference is: a DataFrame with default schema has custom objects as records, other DataFrames has rows as records.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188640712
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189201281
  
    **[Test build #52039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52039/consoleFull)** for PR 11347 at commit [`c178beb`](https://github.com/apache/spark/commit/c178bebb4f820252edc38b972a01fdf5aca51c1a).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188862312
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-190085962
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188640714
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51925/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189118045
  
    **[Test build #52018 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52018/consoleFull)** for PR 11347 at commit [`60ea3d3`](https://github.com/apache/spark/commit/60ea3d3d675e172124d80ba7a98324f8670ad8f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188277423
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51878/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189163052
  
    **[Test build #52041 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52041/consoleFull)** for PR 11347 at commit [`7da3ffc`](https://github.com/apache/spark/commit/7da3ffc73eebc53f695ea90bacf7c3c279b06b07).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189201670
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52039/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189083779
  
    **[Test build #52009 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52009/consoleFull)** for PR 11347 at commit [`effac22`](https://github.com/apache/spark/commit/effac22589be5c0a265072375cb2f54943ac80c0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54308933
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -257,53 +264,85 @@ def limit(self, num):
         @ignore_unicode_prefix
         @since(1.3)
         def take(self, num):
    -        """Returns the first ``num`` rows as a :class:`list` of :class:`Row`.
    +        """Returns the first ``num`` records as a :class:`list`.
     
             >>> df.take(2)
             [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
             """
    -        with SCCallSiteSync(self._sc) as css:
    -            port = self._sc._jvm.org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe(
    -                self._jdf, num)
    -        return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
    +        return self.limit(num).collect()
     
         @ignore_unicode_prefix
    -    @since(1.3)
    -    def map(self, f):
    -        """ Returns a new :class:`RDD` by applying a the ``f`` function to each :class:`Row`.
    +    @since(2.0)
    +    def applySchema(self, schema=None):
    +        """Returns a new :class:`DataFrame` by appling the given schema, or infer the schema
    +        by all of the records if no schema is given.
     
    -        This is a shorthand for ``df.rdd.map()``.
    +        It is only allowed to apply schema for DataFrame which is returned by typed operations,
    +        e.g. map, flatMap, etc. And the record type of the schema-applied DataFrame will be row.
     
    -        >>> df.map(lambda p: p.name).collect()
    +        >>> ds = df.map(lambda row: row.name)
    +        >>> ds.collect()
             [u'Alice', u'Bob']
    +        >>> ds.schema
    +        StructType(List(StructField(value,BinaryType,false)))
    +        >>> ds2 = ds.applySchema(StringType())
    +        >>> ds2.collect()
    +        [Row(value=u'Alice'), Row(value=u'Bob')]
    +        >>> ds2.schema
    +        StructType(List(StructField(value,StringType,true)))
    +        >>> ds3 = ds.applySchema()
    +        >>> ds3.collect()
    +        [Row(value=u'Alice'), Row(value=u'Bob')]
    +        >>> ds3.schema
    +        StructType(List(StructField(value,StringType,true)))
    +        """
    +        msg = "Cannot apply schema to a DataFrame which is not returned by typed operations"
    +        raise RuntimeError(msg)
    +
    +    @ignore_unicode_prefix
    +    @since(2.0)
    +    def mapPartitions(self, func):
    +        """Returns a new :class:`DataFrame` by applying the ``f`` function to each partition.
    +
    +        The schema of returned :class:`DataFrame` is a single binary field struct type, please
    +        call `applySchema` to set the corrected schema before apply structured operations, e.g.
    +        select, sort, groupBy, etc.
    +
    +        >>> def f(iterator):
    +        ...     return map(lambda i: 1, iterator)
    +        >>> df.mapPartitions(f).collect()
    +        [1, 1]
             """
    -        return self.rdd.map(f)
    +        return PipelinedDataFrame(self, func)
     
         @ignore_unicode_prefix
    -    @since(1.3)
    -    def flatMap(self, f):
    -        """ Returns a new :class:`RDD` by first applying the ``f`` function to each :class:`Row`,
    -        and then flattening the results.
    +    @since(2.0)
    --- End diff --
    
    This API is introduced in 1.3, but changed in 2.0. We should still keep the  `since(1.3)` here, add `::versionchanged 2.0 xxx` in the doc string


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189204227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52041/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188573074
  
    **[Test build #51924 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51924/consoleFull)** for PR 11347 at commit [`32a04de`](https://github.com/apache/spark/commit/32a04de9f772be4eb576485ec135eed9349073e5).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-190085708
  
    **[Test build #52168 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52168/consoleFull)** for PR 11347 at commit [`9beffc6`](https://github.com/apache/spark/commit/9beffc69fc829b92413f54f89fc55bb59d2a2b56).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189321448
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189083976
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189320837
  
    **[Test build #52052 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52052/consoleFull)** for PR 11347 at commit [`adb2aa9`](https://github.com/apache/spark/commit/adb2aa91564903dae72b6cef010822b9c335993f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189054815
  
    **[Test build #52009 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52009/consoleFull)** for PR 11347 at commit [`effac22`](https://github.com/apache/spark/commit/effac22589be5c0a265072375cb2f54943ac80c0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189275014
  
    **[Test build #52052 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52052/consoleFull)** for PR 11347 at commit [`adb2aa9`](https://github.com/apache/spark/commit/adb2aa91564903dae72b6cef010822b9c335993f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188284926
  
    **[Test build #51880 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51880/consoleFull)** for PR 11347 at commit [`f29ed29`](https://github.com/apache/spark/commit/f29ed294434066a7806eb7b05b195c4aa356f78c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189495109
  
    @cloud-fan Could you update the description of PR to say which approach is take in this PR, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188862298
  
    **[Test build #51975 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51975/consoleFull)** for PR 11347 at commit [`22ef406`](https://github.com/apache/spark/commit/22ef406f6b6166741bd89fc5314f6ad637a8e381).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r53944846
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/python.scala ---
    @@ -0,0 +1,34 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.catalyst.plans.logical
    --- End diff --
    
    I'm quite confusing about where to put the python plans.
    
    For the scala ones, we have a `object.scala` under `org.apache.spark.sql.catalyst.plans.logical` package which contains all encoder related logical plans, and we have a `objects.scala` under `org.apache.spark.sql.execution` package which contains all encoder related physical plans.
    
    For the python ones, current we put all of them under `org.apache.spark.sql.execution.python` package, including logical plans, physical plans, rules, etc.
    
    cc @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54484539
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -1178,6 +1183,74 @@ def test_functions_broadcast(self):
             # planner should not crash without a join
             broadcast(df1)._jdf.queryExecution().executedPlan()
     
    +    def test_basic_typed_operations(self):
    --- End diff --
    
    Maybe it will be easier to reason about test failures if we break it to multiple `def`s?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-190225735
  
    Hi @davies
    I did a simple benchmark for pickle serializer with 1 million (int, string) rows. The execution time only includes serialize results at python side and send bytes to JVM side, the result is: batched serializer is about 2.67 times faster and serialized size is about 30% smaller.
    
    I think this is a problem. We should find another way to overcome the issue that binary data at JVM side has wrong row count, so that we can still use batched serializer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54310206
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -859,14 +905,20 @@ def filter(self, condition):
             [Row(age=5, name=u'Bob')]
             >>> df.where("age = 2").collect()
             [Row(age=2, name=u'Alice')]
    +
    +        >>> df.filter(lambda row: row.age > 3).collect()
    +        [Row(age=5, name=u'Bob')]
    +        >>> df.map(lambda row: row.age).filter(lambda age: age > 3).collect()
    +        [5]
             """
             if isinstance(condition, basestring):
    -            jdf = self._jdf.filter(condition)
    +            return DataFrame(self._jdf.filter(condition), self.sql_ctx)
    --- End diff --
    
    This DataFrame could have schema or not, should we only allow this on typed DataFrame?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54310472
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -859,14 +905,20 @@ def filter(self, condition):
             [Row(age=5, name=u'Bob')]
             >>> df.where("age = 2").collect()
             [Row(age=2, name=u'Alice')]
    +
    +        >>> df.filter(lambda row: row.age > 3).collect()
    +        [Row(age=5, name=u'Bob')]
    +        >>> df.map(lambda row: row.age).filter(lambda age: age > 3).collect()
    --- End diff --
    
    What's the type of `df.map(lambda row: row.age)`? It's a DataFrame of StructType(BinaryType) or IntegerType?
    
    This looks confusing to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188647094
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51926/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54309652
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1753,15 +1753,26 @@ class DataFrame private[sql](
        * Converts a JavaRDD to a PythonRDD.
        */
       protected[sql] def javaToPython: JavaRDD[Array[Byte]] = {
    -    val structType = schema  // capture it for closure
    -    val rdd = queryExecution.toRdd.map(EvaluatePython.toJava(_, structType))
    -    EvaluatePython.javaToPython(rdd)
    +    if (isOutputPickled) {
    +      queryExecution.toRdd.map(_.getBinary(0))
    --- End diff --
    
    The picked binary can't be used in JVM directly without unpickling, what's the purpose of having non-batched pickled binary format?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189321451
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52052/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-190085963
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52168/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189204045
  
    **[Test build #52041 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52041/consoleFull)** for PR 11347 at commit [`7da3ffc`](https://github.com/apache/spark/commit/7da3ffc73eebc53f695ea90bacf7c3c279b06b07).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54208704
  
    --- Diff: python/pyspark/sql/context.py ---
    @@ -48,28 +48,29 @@
     
     
     def _monkey_patch_RDD(sqlContext):
    -    def toDF(self, schema=None, sampleRatio=None):
    +    def toDS(self, schema=None, sampleRatio=None):
             """
    -        Converts current :class:`RDD` into a :class:`DataFrame`
    +        Converts current :class:`RDD` into a :class:`Dataset`
     
    -        This is a shorthand for ``sqlContext.createDataFrame(rdd, schema, sampleRatio)``
    +        This is a shorthand for ``sqlContext.createDataset(rdd, schema, sampleRatio)``
     
             :param schema: a StructType or list of names of columns
             :param samplingRatio: the sample ratio of rows used for inferring
    -        :return: a DataFrame
    +        :return: a Dataset
     
    -        >>> rdd.toDF().collect()
    +        >>> rdd.toDS().collect()
             [Row(name=u'Alice', age=1)]
             """
    -        return sqlContext.createDataFrame(self, schema, sampleRatio)
    +        return sqlContext.createDataset(self, schema, sampleRatio)
     
    -    RDD.toDF = toDF
    +    RDD.toDS = toDS
    +    RDD.toDF = toDS  # for compatibility
    --- End diff --
    
    is there a test for backward compatibility?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54328489
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1753,15 +1753,26 @@ class DataFrame private[sql](
        * Converts a JavaRDD to a PythonRDD.
        */
       protected[sql] def javaToPython: JavaRDD[Array[Byte]] = {
    -    val structType = schema  // capture it for closure
    -    val rdd = queryExecution.toRdd.map(EvaluatePython.toJava(_, structType))
    -    EvaluatePython.javaToPython(rdd)
    +    if (isOutputPickled) {
    +      queryExecution.toRdd.map(_.getBinary(0))
    --- End diff --
    
    It will be faster when do the count() in Python, then you don't need to pass all these rows into JVM.
    
    Even for groupByKey(), if the key_func is Python function, the rows will be deserialized in Python.
    
    For in-memory cache, the rows are packed as batched in JVM, so it's not true that we need the rows are serialized separately to be able to be processed in JVM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188573083
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188572868
  
    **[Test build #51925 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51925/consoleFull)** for PR 11347 at commit [`e549d48`](https://github.com/apache/spark/commit/e549d48b44199a339e908ec9a807a84cb7fb80a1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54329264
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1753,15 +1753,26 @@ class DataFrame private[sql](
        * Converts a JavaRDD to a PythonRDD.
        */
       protected[sql] def javaToPython: JavaRDD[Array[Byte]] = {
    -    val structType = schema  // capture it for closure
    -    val rdd = queryExecution.toRdd.map(EvaluatePython.toJava(_, structType))
    -    EvaluatePython.javaToPython(rdd)
    +    if (isOutputPickled) {
    +      queryExecution.toRdd.map(_.getBinary(0))
    --- End diff --
    
    For `groupByKey`, we need the JVM to do grouping for us, i.e. cluster by key and sort by key. If the grouped Dataset contains custom objects, we need to send it JVM and keep it as binary while do the grouping. As we need to append the key data to every input row, we should use un-batched pickler here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188578685
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188285763
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51880/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188579211
  
    **[Test build #51926 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51926/consoleFull)** for PR 11347 at commit [`e549d48`](https://github.com/apache/spark/commit/e549d48b44199a339e908ec9a807a84cb7fb80a1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189118177
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan closed the pull request at:

    https://github.com/apache/spark/pull/11347


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189118179
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52018/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54308480
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -232,14 +241,12 @@ def count(self):
         @ignore_unicode_prefix
         @since(1.3)
         def collect(self):
    -        """Returns all the records as a list of :class:`Row`.
    +        """Returns all the records as a list.
     
             >>> df.collect()
             [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
             """
    -        with SCCallSiteSync(self._sc) as css:
    -            port = self._jdf.collectToPython()
    -        return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
    +        return self.rdd.collect()
    --- End diff --
    
    Does this work for SQL UI?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188862315
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51975/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189128860
  
    We should also add tests for the compatibility methods.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL][WIP] Python Dataset (basic ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-188277413
  
    **[Test build #51878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51878/consoleFull)** for PR 11347 at commit [`a073f83`](https://github.com/apache/spark/commit/a073f831adfd7640c40f9455ddcc02b9667db9ad).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11347#discussion_r54208792
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/python.scala ---
    @@ -0,0 +1,34 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.catalyst.plans.logical
    --- End diff --
    
    org.apache.spark.sql.execution.python is probably ok, since there is no concept of python in catalyst itself.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org