You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ueshin <gi...@git.apache.org> on 2014/04/01 10:40:17 UTC

[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

GitHub user ueshin opened a pull request:

    https://github.com/apache/spark/pull/283

    SPARK-1380: Add sort-merge based cogroup/joins.

    I've written cogroup/joins based on 'Sort-Merge' algorithm.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ueshin/apache-spark issues/SPARK-1380

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/283.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #283
    
----
commit 1c8ba5a0d480f816a0c217618b40bb615474963d
Author: Takuya UESHIN <ue...@happy-camper.st>
Date:   2014-03-19T10:28:26Z

    Add sort-merge cogroup/joins.

commit 99751661fcc7632a0f82816bbaca07bf822d3663
Author: Takuya UESHIN <ue...@happy-camper.st>
Date:   2014-03-25T10:15:09Z

    Add Java APIs for sort-merge cogroup/joins.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/283#issuecomment-39182575
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/283#issuecomment-39182558
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/283#issuecomment-39187101
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/283#issuecomment-62332367
  
    I'd suggest we close this issue for now and go to the JIRA to discuss whether the feature is needed and how high of a priority it is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/283


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the pull request:

    https://github.com/apache/spark/pull/283#issuecomment-39421176
  
    @mridulm Thank you for your reply.
    
    There are 2 points I have to mention about memory:
    
    - Before shuffle  
    If data are sorted, no more memory is needed because no sort operation is needed, and if not sorted, merge join needs some amount of memory to sort data in each partition.
    - After shuffle  
    Merge join needs at most the same amount of memory as hash join while fetching data, but it does not need more memory because it can produce output immediately from input. Hash join needs some more memory to build a hash table.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

Posted by nchammas <gi...@git.apache.org>.
Github user nchammas commented on the pull request:

    https://github.com/apache/spark/pull/283#issuecomment-56483665
  
    @pwendell @rxin @mateiz What is the status of this PR? It looks pretty substantial, but it hasn't been updated in a while.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/283#issuecomment-39187102
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13626/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the pull request:

    https://github.com/apache/spark/pull/283#issuecomment-39417683
  
    @rxin Thank you for your reply.
    
    There are some case to use merge join for optimization:
    
    1. If data to be joined are already sorted by join keys, merge join would be done more efficiently than hash join. In my test case, both algorithms were almost same speed, but merge join was scalable.
    2. Merge join for sorted data by the same keys would be pipelined (each output can be produced immediately for arrived input tuples) even if N-way join (N>2). Hash join blocks due to building a hash-table before output are produced.
    
    I think it is useful for users to choose ways to optimize their processing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/283#issuecomment-39245668
  
    Is there a specific use case you are trying to address that cannot be handled by the hash join?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/283#issuecomment-39286182
  
    I have not done a detailed review - but looks pretty expensive in terms of memory.
    Is it making assumptions about lack of skew w.r.t a key and amount of data per partition (that it can be held entirely in memory)  ?
    Would be good to document what are the constraints of the solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---