You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by patrick-nicholson <gi...@git.apache.org> on 2017/05/09 20:39:10 UTC

[GitHub] spark pull request #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSl...

GitHub user patrick-nicholson opened a pull request:

    https://github.com/apache/spark/pull/17926

    [MINOR][SQL][PYSPARK] Allow user to specify numSlices in SparkSession.createDataFrame

    ## What changes were proposed in this pull request?
    
    In my experience, pushing `pandas.DataFrame`s to `pyspark.DataFrame`s will very quickly run up against size issues. These can usually be remedied by changing configuration parameters (e.g., `spark.rpc.message.maxSize`), but it is much more convenient to change the level of parallelization used during `RDD` creation. This option is available in `sparkContext.broadcast`. This pull request exposes it to `sparkSession.createDataFrame`. 
    
    ## How was this patch tested?
    
    I have been using a patch implementing this change for a while. I'm only exposing a keyword argument used by an underlying function to the user.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/patrick-nicholson/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17926.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17926
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    **[Test build #76698 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76698/testReport)** for PR 17926 at commit [`c9a6348`](https://github.com/apache/spark/commit/c9a63483672c3cc2b5a89eab03f3b7fe63156e72).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by patrick-nicholson <gi...@git.apache.org>.
Github user patrick-nicholson commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    > It seems adding a functionality and not a trivial fix. I think we need a JIRA.
    
    It's up to you. All I'm doing is passing a keyword argument from one preexisting public method to another. I don't view that as adding functionality, but I am not the arbiter of such things.
    
    > I think this is a rather niche case and we can workaround by parallelizing outside.
    
    It has been a rather common case for me since I'm often working with `pandas.DataFrame`s of millions of rows with many columns of mixed types (where any numeric types are implicitly `numpy` types, rather than base). It can be worked outside by manually performing the steps inside of `createDataFrame`:
    
    ```
    df = spark.createDataFrame(spark.sparkContext.parallelize([r.tolist() for r in pandas_df.to_records(index=False)], numSlices=5)), schema=[str(_) for _ in pandas_df.columns])
    ```
    
    Again, I don't see the proposed change as adding any functionality, just exposing machinery already in place for distributing Python data to an `RDD` in a consistent way.
    
    > Also, this looks only applying when the data is not RDD. I think this is confusing if a user sets this and this option is not working in some cases unless the user reads the documentation.
    
    Given that `RDD` and local data are necessarily different and that `createDataFrame` already has separate code paths for `RDD` and local Python data, I don't know how this can be avoided.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    Does this cause any incompatibility with existing code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76700/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    How about adding this workaround to the function description of `createDataFrame` now? In the future, we can change the interface if more people needs this?
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    I don't think so (this is Python ... ) for both positional and keyword arguments. (If the new `numSlices` is added in the middle of the arguments it will break for positional arguments but this one adds it at the last).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    **[Test build #76701 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76701/testReport)** for PR 17926 at commit [`4a9d58d`](https://github.com/apache/spark/commit/4a9d58d03ed945da17823c85e22d787287c6c21e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    FYI we added `numPartitions` in R - but that's primarily because we don't have `sc.parallelize`
    https://github.com/apache/spark/blob/master/R/pkg/R/SQLContext.R#L190



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    **[Test build #76700 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76700/testReport)** for PR 17926 at commit [`6ef9fdd`](https://github.com/apache/spark/commit/6ef9fdd5455cc0c11c0d9de237b4f28cd80b1886).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    It seems adding a functionality and not a trivial fix. I think we need a JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    **[Test build #76698 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76698/testReport)** for PR 17926 at commit [`c9a6348`](https://github.com/apache/spark/commit/c9a63483672c3cc2b5a89eab03f3b7fe63156e72).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76698/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    No, it is not virtually the same before/after (and also we need a regression test). So, it needs a JIRA - see http://spark.apache.org/contributing.html. Adding an parameter to `createDataFrame` is to add a functionality to `createDataFrame` that does not exist before in this API. 
    
    As you said, this can be done in a single line like that, you could just make a wrapper function for it in application side in few lines.
    
    ```python
    def createDataFrame(data, numSlices, **kwargs):
        return spark.createDataFrame(
            spark.sparkContext.parallelize(data, numSlices=numSlices), **kwargs)
    ```
    
    I am not sure if it is worth adding this parameter. It looks there is a potential confusion to users and workaround looks so easy. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    **[Test build #76700 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76700/testReport)** for PR 17926 at commit [`6ef9fdd`](https://github.com/apache/spark/commit/6ef9fdd5455cc0c11c0d9de237b4f28cd80b1886).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSl...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17926


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76701/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    **[Test build #76701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76701/testReport)** for PR 17926 at commit [`4a9d58d`](https://github.com/apache/spark/commit/4a9d58d03ed945da17823c85e22d787287c6c21e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org