You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by zsxwing <gi...@git.apache.org> on 2016/02/22 23:50:08 UTC

[GitHub] spark pull request: Fix the issue that Iterator.map().toSeq is not...

GitHub user zsxwing opened a pull request:

    https://github.com/apache/spark/pull/11313

    Fix the issue that Iterator.map().toSeq is not Serializable

    ## What changes were proposed in this pull request?
    
    `scala.collection.Iterator`'s methods (e.g., map, filter) will return an `AbstractIterator` which is not Serializable. E.g.,
    ```Scala
    scala> val iter = Array(1, 2, 3).iterator.map(_ + 1)
    iter: Iterator[Int] = non-empty iterator
    
    scala> println(iter.isInstanceOf[Serializable])
    false
    ```
    If we call something like `scala.collection.Iterator.map(...).toSeq`, it will a `Stream` that contains a non-serializable `AbstractIterator` field and make the `Stream` be non-serializable.
    
    This PR uses `toArray` instead of `toSeq` to fix such issue in `def createDataFrame(data: java.util.List[_], beanClass: Class[_]): DataFrame`.  
    
    ## How was the this patch tested?
    
    Jenkins tests.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zsxwing/spark SPARK-13390

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11313.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11313
    
----
commit 07a88b5f45028c5460fee0ea679095e589feea6e
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-02-22T22:41:46Z

    Fix the issue that Iterator.map().toSeq is not Serializable

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/11313#issuecomment-187458974
  
    cc @holdenk @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/11313#issuecomment-187955679
  
    Just found this happened to be fixed in #10511. `rows` will be materialized in LocalTableScan.
    
    I'm going to resubmit this patch against branch-1.6 since this is not necessary for master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/11313#issuecomment-187462066
  
    Maybe good to add a test which would have failed before this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11313#discussion_r53724620
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -579,7 +579,7 @@ class SQLContext private[sql](
         val className = beanClass.getName
         val beanInfo = Introspector.getBeanInfo(beanClass)
         val rows = SQLContext.beansToRows(data.asScala.iterator, beanInfo, attrSeq)
    -    DataFrame(self, LocalRelation(attrSeq, rows.toSeq))
    +    DataFrame(self, LocalRelation(attrSeq, rows.toArray))
    --- End diff --
    
    toArray could be kind of expensive for large local iterators - maybe we should only do this if necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11313#discussion_r53731184
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -579,7 +579,7 @@ class SQLContext private[sql](
         val className = beanClass.getName
         val beanInfo = Introspector.getBeanInfo(beanClass)
         val rows = SQLContext.beansToRows(data.asScala.iterator, beanInfo, attrSeq)
    -    DataFrame(self, LocalRelation(attrSeq, rows.toSeq))
    +    DataFrame(self, LocalRelation(attrSeq, rows.toArray))
    --- End diff --
    
    yah I know we need to materialize anyways (toSeq forces that too on Iterators) - but I meant more the cost of copying all the objects to an array (see https://github.com/scala/scala/blob/v2.11.1/src/library/scala/collection/TraversableOnce.scala#L269 ) so maybe only calling toArray if necessary would be worth while (but perhaps not).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/11313#issuecomment-187957084
  
    See #11334 instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11313#discussion_r53726644
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -579,7 +579,7 @@ class SQLContext private[sql](
         val className = beanClass.getName
         val beanInfo = Introspector.getBeanInfo(beanClass)
         val rows = SQLContext.beansToRows(data.asScala.iterator, beanInfo, attrSeq)
    -    DataFrame(self, LocalRelation(attrSeq, rows.toSeq))
    +    DataFrame(self, LocalRelation(attrSeq, rows.toArray))
    --- End diff --
    
    I think you have to materialize it anyways in order to support multiple scans.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11313#issuecomment-187451015
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51694/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11313#issuecomment-187451012
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue that Iterator....

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11313#issuecomment-187425079
  
    **[Test build #51694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51694/consoleFull)** for PR 11313 at commit [`07a88b5`](https://github.com/apache/spark/commit/07a88b5f45028c5460fee0ea679095e589feea6e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11313#issuecomment-187450267
  
    **[Test build #51694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51694/consoleFull)** for PR 11313 at commit [`07a88b5`](https://github.com/apache/spark/commit/07a88b5f45028c5460fee0ea679095e589feea6e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing closed the pull request at:

    https://github.com/apache/spark/pull/11313


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13390][SQL]Fix the issue due to Iterato...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11313#discussion_r53755996
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -579,7 +579,7 @@ class SQLContext private[sql](
         val className = beanClass.getName
         val beanInfo = Introspector.getBeanInfo(beanClass)
         val rows = SQLContext.beansToRows(data.asScala.iterator, beanInfo, attrSeq)
    -    DataFrame(self, LocalRelation(attrSeq, rows.toSeq))
    +    DataFrame(self, LocalRelation(attrSeq, rows.toArray))
    --- End diff --
    
    I think it was the same before -- it was materialized into a `Seq`. I think Josh is suggesting this is inherently necessary before making a `LocalRelation` so there's no meaningful opportunity for lazy eval (?)  If so then this seems fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org