You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by "Thomas Jungblut (Created) (JIRA)" <ji...@apache.org> on 2012/03/25 19:31:24 UTC

[jira] [Created] (HAMA-540) Create distributed sort BSP

Create distributed sort BSP
---------------------------

                 Key: HAMA-540
                 URL: https://issues.apache.org/jira/browse/HAMA-540
             Project: Hama
          Issue Type: New Feature
          Components: bsp, examples
            Reporter: Thomas Jungblut


For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-540) Create distributed sort BSP

Posted by "praveen sripati (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253367#comment-13253367 ] 

praveen sripati commented on HAMA-540:
--------------------------------------

{quote}
And BTW I think it is better to sort the smaller pieces after the partitioning in the same superstep and after the send of the sorted parts just merge them together.
{quote}

After sending the sorts parts, the merging and sorting has to be done again. Just merging won't be enough.
                
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp, examples
>            Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-540) Create distributed sort BSP

Posted by "Thomas Jungblut (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253178#comment-13253178 ] 

Thomas Jungblut commented on HAMA-540:
--------------------------------------

Nice paper thanks, Sampling Sort seems to be very efficient. I'm having a look at it later today, maybe I can start a rough prototype. 

                
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp, examples
>            Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-540) Create distributed sort BSP

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414720#comment-13414720 ] 

Thomas Jungblut commented on HAMA-540:
--------------------------------------

I will generify my sampling sort and create a patch for it.
                
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp core, examples
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-540) Create distributed sort BSP

Posted by "praveen sripati (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13252584#comment-13252584 ] 

praveen sripati commented on HAMA-540:
--------------------------------------

What algorithm to use for sorting with BSP? (1) mentions about quick sort, merge sort and sampling sort with BSP. Sampling sort seems to be the easiest to implement :)

(1) - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.870
                
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp, examples
>            Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (HAMA-540) Create distributed sort BSP

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut reassigned HAMA-540:
------------------------------------

    Assignee: Thomas Jungblut
    
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp core, examples
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-540) Create distributed sort BSP

Posted by "praveen sripati (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253303#comment-13253303 ] 

praveen sripati commented on HAMA-540:
--------------------------------------

1) It would be good to have the initial (p-1) pivots evenly distributed over the input data. This way the processors take almost same time to complete the task. How are you planning to get them?

In the TeraSort#readPartitions looks like the first reducer # of lines are read from the input data. Not an efficient way.

http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/terasort/TeraSort.java

2) Slightly less than double the size of the input data is passed between the processors in the sampling sort. If the input is 1TB, then ~2TB is transferred between the nodes. Not sure if this is OK? It would be nice to know how much data is transferred in merge sort and quick sort.

----

I am just getting a feel of Hama/BSP and thought of implementing sampling sort. Please do it. I will pick the next easy one :)
                
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp, examples
>            Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-540) Create distributed sort BSP

Posted by "Thomas Jungblut (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253302#comment-13253302 ] 

Thomas Jungblut commented on HAMA-540:
--------------------------------------

Here's my first prototype:
https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/bsp/SamplingSort.java

I am really astonished that this works- 

You see that the pivotting is a bit naive, because the distribution is totally not even (that -> mapping between the logs). 

{noformat}
12/04/13 13:39:19 INFO bsp.FileInputFormat: Total input paths to process : 1
12/04/13 13:39:19 INFO bsp.FileInputFormat: Total # of splits: 7
12/04/13 13:39:19 WARN bsp.BSPJobClient: No job jar file set.  User classes may not be found. See BSPJob#setJar(String) or check Your jar file.
12/04/13 13:39:20 INFO bsp.BSPJobClient: Running job: job_localrunner_0001
12/04/13 13:39:22 INFO bsp.LocalBSPRunner: Setting up a new barrier for 7 tasks!
local:6 -> 176
local:2 -> 133
local:5 -> 189
local:0 -> 113
local:3 -> 92
local:4 -> 29
local:1 -> 78
12/04/13 13:39:23 INFO bsp.BSPJobClient: Current supersteps number: 1
12/04/13 13:39:23 INFO bsp.BSPJobClient: The total number of supersteps: 1
from file:/tmp/hama-sampling-out/part-00000
-2145373038 -2135777393 -2127418941 -2127349118 -2116694526 -2112753401 -2111019858 -2109843938 -2109467658 
-1775154178 -1771096268 -1768609402 -1767599475 -1753155542 -1744884630 -1736545907 -1734220768 -1727656934 
-1727161209 -1724429198 -1712603905 -1711206669 -1693536736 
from file:/tmp/hama-sampling-out/part-00001
-1684778946 -1683715271 -1677988183 -1673772158 -1672941153 -1669199897 -1661791404 -1660526886 -1658572801 
-1579967204 -1577470192 -1569276585
<rest omitted>
{noformat}

However I very much doubt that the algorithm is faster than MapReduce. I think we can use the Quicksort class in Hadoop to further optimize, I used Java7's new Timsort in an Arrays.sort() because it is in-place. To get there, I have a huge collections overhead and RAM usage. 
But the idea of the algorithm is very cool.
                
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp, examples
>            Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-540) Create distributed sort BSP

Posted by "Thomas Jungblut (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253355#comment-13253355 ] 

Thomas Jungblut commented on HAMA-540:
--------------------------------------

And BTW I think it is better to sort the smaller pieces after the partitioning in the same superstep and after the send of the sorted parts just merge them together.
                
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp, examples
>            Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-540) Create distributed sort BSP

Posted by "Thomas Jungblut (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253372#comment-13253372 ] 

Thomas Jungblut commented on HAMA-540:
--------------------------------------

it's enough because you are merging n-sorted segments. And it works. 
                
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp, examples
>            Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-540) Create distributed sort BSP

Posted by "Thomas Jungblut (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253308#comment-13253308 ] 

Thomas Jungblut commented on HAMA-540:
--------------------------------------

bq.1) It would be good to have the initial (p-1) pivots evenly distributed over the input data. This way the processors take almost same time to complete the task. How are you planning to get them?

Yes that is the major point in this algorithm, I use the same naive way to get the pivots. There is this median in linear time algorithm from Tarjan and Rivest ( http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm . 
However noone is going to iterative over 1TB of data to determine a good pivot.
When you have big data the chances that the data is evenly distributed over the processes by picking the first n-elements is quite high.


bq. 2) Slightly less than double the size of the input data is passed between the processors in the sampling sort. If the input is 1TB, then ~2TB is transferred between the nodes. Not sure if this is OK? It would be nice to know how much data is transferred in merge sort and quick sort.

That is the key why I think this is not very efficient.

bq. I am just getting a feel of Hama/BSP and thought of implementing sampling sort. Please do it. I will pick the next easy one 

I'm sorry! You can also implement sampling sort, maybe you interpret the paper other than me. 
                
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp, examples
>            Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-540) Create distributed sort BSP

Posted by "praveen sripati (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253402#comment-13253402 ] 

praveen sripati commented on HAMA-540:
--------------------------------------

Deterministic Sorting and Randomized Median Finding on the BSP model - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.25.8776
                
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp, examples
>            Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira