You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by FRosner <gi...@git.apache.org> on 2015/10/22 10:50:41 UTC

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

GitHub user FRosner opened a pull request:

    https://github.com/apache/spark/pull/9222

    SPARK-11258 Remove quadratic runtime complexity for converting a Spark DataFrame into an R data.frame

    https://issues.apache.org/jira/browse/SPARK-11258
    
    I was not able to locate an existing unit test for this function so I wrote one.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/FRosner/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9222.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9222
    
----
commit e8ed0bd81f1eafafb0fa04eadf6b744baf712e88
Author: Frank Rosner <fr...@fam-rosner.de>
Date:   2015-10-22T08:47:05Z

    [SPARK-11258][SPARKR] convert obsolete comment into a type annotation

commit 4be5df3647673fe3aab3787da1301eac76be409f
Author: Frank Rosner <fr...@fam-rosner.de>
Date:   2015-10-22T08:48:15Z

    [SPARK-11258][SPARKR] provide test for existing dfToCols function

commit e85d42cdad0a2c10a3c6587cb8767824d1d80224
Author: Frank Rosner <fr...@fam-rosner.de>
Date:   2015-10-22T08:48:47Z

    [SPARK-11258][SPARKR] convert row to column wise representation in linear time

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by Gerrrr <gi...@git.apache.org>.

Github user Gerrrr commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150659258
  
    The problem might be related to GC as a current implementation of `dfToCols` allocates memory for the map result and the array while @FRosner implementation only allocates memory for the array.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150778337
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by FRosner <gi...@git.apache.org>.

Github user FRosner commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-151293332
  
    @felixcheung the runtime performance tests I conducted came to the result that the performance gain is minimal if there is enough RAM available. It was faster all the time but only a few per cent. However when we tried to load a huge parquet file (70 GB uncompressed), we ran into problems with the old implementation. As @Gerrrr indicated it might be related to the map and then .toArray, which causes memory overhead. I think when it comes to collecting data to the driver we should ensure that we have a small memory consumption.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150382796
  
    Do you have benchmark numbers for this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150700586
  
    **[Test build #44262 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44262/consoleFull)** for PR 9222 at commit [`f857546`](https://github.com/apache/spark/commit/f85754663b864aa573337a60e67add370dc10fd0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150757273
  
    @FRosner, there is no test case dedicated for dfToCols in R now. I am still reluctant to add a new test case in Scala simply for dfToCols. I'd like to leave the decision to other's comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-151232905
  
    @FRosner I assume this is faster than the original implementation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by FRosner <gi...@git.apache.org>.

Github user FRosner commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150778118
  
    @sun-rui thanks for your comments. You should not hesitate to add unit tests for functions that are not tested, yet. Tests help other developers (like me) to change stuff in the code without the immediate fear of unforseen consequences. They also document the API. As this function is public, it should have a test.
    
    I will look into the build failure reason now @shivaram. I was not able to run all tests locally as I got out of memory problems, although I changed the maven settings as indicated in the documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by FRosner <gi...@git.apache.org>.

Github user FRosner commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-151236579
  
    In my opinion we are good to go if there are no objections from your side. 
    -- 
    Sent from my phone. Please excuse my brevity.
    
    Shivaram Venkataraman <no...@github.com> wrote:
    
    The class is already private[r] so it should be fine.
    
    @FRosner Is the change good to go or are you doing any more perf tests ?
    
    —
    Reply to this email directly or view it on GitHub. 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9222#discussion_r42883859
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala ---
    @@ -130,16 +130,18 @@ private[r] object SQLUtils {
       }
     
       def dfToCols(df: DataFrame): Array[Array[Any]] = {
    -    // localDF is Array[Row]
    -    val localDF = df.collect()
    +    val localDF: Array[Row] = df.collect()
         val numCols = df.columns.length
    +    val numRows = localDF.length
     
    -    // result is Array[Array[Any]]
    -    (0 until numCols).map { colIdx =>
    -      localDF.map { row =>
    -        row(colIdx)
    +    val colArray = new Array[Array[Any]](numCols)
    +    for (colNo <- 0 until numCols) {
    --- End diff --
    
    Using a while loop here instead of a for loop should also help with performance


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150166755
  
    Instead of creating a new testsuite in Scala, you can add a new test case in R, using
    callJStatic to invoke "dfToCols" on the Scala side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150778791
  
    **[Test build #44293 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44293/consoleFull)** for PR 9222 at commit [`7e34d30`](https://github.com/apache/spark/commit/7e34d30bd8949ba15179d48e9f85a9b01e328cdb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-151194783
  
    The class is already `private[r]` so it should be fine.
    
    @FRosner Is the change good to go or are you doing any more perf tests ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by FRosner <gi...@git.apache.org>.

Github user FRosner commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-151056747
  
    @sun-rui I am not sure how callJStatic works but I can imagine that making a method private does not allow it to be called - neither from the test, nor from R. Why would you not want to have a unit test. Test driven development is a good invention and helps developers (especially in OSS) to trust the code and make changes much easier. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150788661
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150700188
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

Posted by FRosner <gi...@git.apache.org>.

Github user FRosner commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150494201
  
    @felixcheung the ticket is mentioning two numbers (https://issues.apache.org/jira/browse/SPARK-11258#) in the description. For a small data frame it is already a speedup of 4. But anyway - quadratic complexity is the problem as you run into issues with increasing number of columns quickly. We were not even able to load one of our data frames (> 1 Million rows, several hundred of columns) with the old method but it ran through with the new one.
    
    I can provide some more benchmarks later today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

Posted by FRosner <gi...@git.apache.org>.

Github user FRosner commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9222#discussion_r42734750
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala ---
    @@ -130,16 +130,17 @@ private[r] object SQLUtils {
       }
     
       def dfToCols(df: DataFrame): Array[Array[Any]] = {
    -    // localDF is Array[Row]
    -    val localDF = df.collect()
    +    val localDF: Array[Row] = df.collect()
         val numCols = df.columns.length
    +    val numRows = localDF.length
     
    -    // result is Array[Array[Any]]
    -    (0 until numCols).map { colIdx =>
    -      localDF.map { row =>
    -        row(colIdx)
    -      }
    -    }.toArray
    +    val colArray = new Array[Array[Any]](numCols)
    +    for (colNo <- 0 to (numCols - 1)) colArray.update(colNo, new Array[Any](numRows))
    --- End diff --
    
    Seems also plausible. Do you want me to change it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9222


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9222#discussion_r42733892
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala ---
    @@ -130,16 +130,17 @@ private[r] object SQLUtils {
       }
     
       def dfToCols(df: DataFrame): Array[Array[Any]] = {
    -    // localDF is Array[Row]
    -    val localDF = df.collect()
    +    val localDF: Array[Row] = df.collect()
         val numCols = df.columns.length
    +    val numRows = localDF.length
     
    -    // result is Array[Array[Any]]
    -    (0 until numCols).map { colIdx =>
    -      localDF.map { row =>
    -        row(colIdx)
    -      }
    -    }.toArray
    +    val colArray = new Array[Array[Any]](numCols)
    +    for (colNo <- 0 to (numCols - 1)) colArray.update(colNo, new Array[Any](numRows))
    --- End diff --
    
    What about the following code:
        for (colIdx <- 0 until numCols) {
          colArray(colIdx) = new Array[Any](numRows)
        }
        for (rowIdx <- 0 until numRows; colIdx <- 0 until numCols) {
          colArray(colIdx)(rowIdx) = localDF(rowIdx)(colIdx)
        }


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150778341
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150788665
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44293/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150803696
  
    @FRosner, dfToCols is not a public API, and is now only a helper function for SparkR. Could you add a private modifier for it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by FRosner <gi...@git.apache.org>.

Github user FRosner commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150778223
  
    @shivaram I added the missing license header. Sorry for missing this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

Posted by FRosner <gi...@git.apache.org>.

Github user FRosner commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150494394
  
    @sun-rui so there is an existing R test case for this method? In the end I just did a refactoring so existing tests should not break.
    
    The idea of unit tests however is to test one thing. A unit test for a method should test exactly what this method is doing and not other related things like serde. If serde breaks, you still want to test for this method to succeed, because it is working. And since it is public API, anyone using Spark can use it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150377869
  
    dfToCols is meant to be called from R side, it makes sense to test it in R. Having a test case  in R for it can test not only the logic in Scala, but also helps to test other R Scala brige logic including serde. There is no test case in Scala for SparkR for now, of course, we will add if really needed
    
    test cases in R are at R/pkg/inst/tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150700637
  
    **[Test build #44262 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44262/consoleFull)** for PR 9222 at commit [`f857546`](https://github.com/apache/spark/commit/f85754663b864aa573337a60e67add370dc10fd0).
     * This patch **fails RAT tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150788399
  
    **[Test build #44293 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44293/consoleFull)** for PR 9222 at commit [`7e34d30`](https://github.com/apache/spark/commit/7e34d30bd8949ba15179d48e9f85a9b01e328cdb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150700209
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150700643
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44262/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-151305314
  
    Ok I see -- we used to make a Seq[Array[Any]] and then call toArray on it before and now we just have Array[Array[Any]]. Alright change LGTM. Merging this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

Posted by FRosner <gi...@git.apache.org>.

Github user FRosner commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150567323
  
    @felixcheung it is strange. When I did some performance checks today, the run time did not seem to be quadradic. Is it maybe because the map transformation in the original code is lazy and only executed when you actually access the data?
    
    I can tell you that the original implementation did not succeed on our file while mine did. I will investigate a bit more. Maybe someone else has an opinion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by FRosner <gi...@git.apache.org>.

Github user FRosner commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9222#discussion_r42884105
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala ---
    @@ -130,16 +130,18 @@ private[r] object SQLUtils {
       }
     
       def dfToCols(df: DataFrame): Array[Array[Any]] = {
    -    // localDF is Array[Row]
    -    val localDF = df.collect()
    +    val localDF: Array[Row] = df.collect()
         val numCols = df.columns.length
    +    val numRows = localDF.length
     
    -    // result is Array[Array[Any]]
    -    (0 until numCols).map { colIdx =>
    -      localDF.map { row =>
    -        row(colIdx)
    +    val colArray = new Array[Array[Any]](numCols)
    +    for (colNo <- 0 until numCols) {
    --- End diff --
    
    Yeah we might give this a try but I don't think that the loop itself is the problem but rather the stuff that is going on in map and then .toArray. I will investigate a bit more over the week end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

Posted by FRosner <gi...@git.apache.org>.

Github user FRosner commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150179591
  
    Thanks for the reply.
    
    > Instead of creating a new testsuite in Scala, you can add a new test case in R, using callJStatic to invoke "dfToCols" on the Scala side.
    
    What would be the advantage of this? I think it is good practice to have unit tests for all public methods. Using `callJStatic` seems like tackling this in a roundabout way.
    
    Where can I find the test cases in R?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150153510
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150758802
  
    @FRosner, it seems that you are investigating why the performance of the old implementation does not behave as you expect? No matter the investigation result, your new version does have better performance. If we can't find a better one than you, let's have it merged first:)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-151060081
  
    @FRosner, I am not denying a unit test:). My point is that it seems not necessary to add a Scala unit case for dfToCols, which is dedicated for SparkR now and its logic is simple. Adding a R test case is fine for me, but it is a plus but not compelling, as its functionality is covered by existing test cases for collect().
    
    As for private modifier, I could be wrong, forget it:)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150700641
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Converting a Spark DataFrame into ...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/9222#issuecomment-150699863
  
    Jenkins, ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org