You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by "Edward J. Yoon (Created) (JIRA)" <ji...@apache.org> on 2012/03/09 06:45:57 UTC

[jira] [Created] (HAMA-531) Data re-partitioning in BSPJobClient

Data re-partitioning in BSPJobClient
------------------------------------

                 Key: HAMA-531
                 URL: https://issues.apache.org/jira/browse/HAMA-531
             Project: Hama
          Issue Type: Improvement
            Reporter: Edward J. Yoon


The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.

We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280737#comment-13280737 ] 

Thomas Jungblut commented on HAMA-531:
--------------------------------------

I think we should delay this a bit. I'm going to commit this partitioning for graph algorithms now since this is our major feature.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut updated HAMA-531:
---------------------------------

    Attachment: HAMA-531_final.patch

okay works now, basically it was because pagerank input adjacent edges were marked as double instead of null.

This broke the serialization.

Fixed, build is fine. I'd like to commit this tomorrow. 

However we should think about how we build the pre-job partitioner.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "praveen sripati (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279804#comment-13279804 ] 

praveen sripati commented on HAMA-531:
--------------------------------------

I haven't gathered any performance metrics, but partitioning in the BSPJobClient (on the node on which the BSP job is submitted) seems to be not very efficient. Moving the data partitioning from the BSPJobClient to do the processing parallely will cut short the total time for processing drastically. So, I am interested in getting some thought process going on around this JIRA.

In the JIRA two approaches have been mentioned.

1. Using BSP to partition the data.
2. Using MR to partition the data.

Using the MR approach

	- The data has to be read by the mappers (READ)
	- The output of the mapper has to be written the file system (WRITE)
	- Reducers have to read the data back from the file system (READ)
	- Reducers process and write the data back to HDFS (WRITE)
	- The BSP Job reads the MR output (READ) and does the processing

So, there are 3 Reads and 2 Writes, before the data is actually processed by the BSP Job.

Using the BSP Job

	- The data is read by the BSP Task (READ)
	- BSP task checks which task the record belongs to using the partitioner and sends the message to the appropriate task.
	- Global Sync
	- The bsp tasks write data to HDFS (optional WRITE)
	- The various bsp tasks receive the message and start processing immediately.

So, there is only 1 Read.

Partitioning using BSP seems to be much faster when compared to MR. The only advantage I see of the MR approach is that since the partitioned data is written to the disk, the same BSP job can be run multiple times without any partitioning the data again. Of course, the BSP tasks could also write the partitioned data to the HDFS to be processed later if required. I don't see any obvious advantage using the MR approach over BSP approach.

Does anyone know how it is done in Giraph?
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281499#comment-13281499 ] 

Thomas Jungblut commented on HAMA-531:
--------------------------------------

That is really bad.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281497#comment-13281497 ] 

Thomas Jungblut commented on HAMA-531:
--------------------------------------

Seems that the sequencefile was splitted right between a text. If you don't provide task number does it work as well?
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut reassigned HAMA-531:
------------------------------------

    Assignee:     (was: Thomas Jungblut)
    
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward J. Yoon updated HAMA-531:
--------------------------------

             Priority: Critical  (was: Major)
    Affects Version/s: 0.5.0
        Fix Version/s: 0.6.0

This issue must be fixed.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>    Affects Versions: 0.5.0
>            Reporter: Edward J. Yoon
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281487#comment-13281487 ] 

Thomas Jungblut commented on HAMA-531:
--------------------------------------

Any tasklogs? Why does it work with 3 tasks and not with 4?
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280133#comment-13280133 ] 

Thomas Jungblut commented on HAMA-531:
--------------------------------------

Two possible approaches:

We schedule a BSP job to write to a given number of files,
OR we use the same logic like the graph repair that will take a first superstep to read all the things and distribute it among the tasks afterwards.

I think that the last solution is quite simple.

bq.Does anyone know how it is done in Giraph?

Don't know, bet on the second solution, since their mapper input isn't very likely to be partitioned. 
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut updated HAMA-531:
---------------------------------

    Attachment: HAMA-531_1.patch

small patch hacked.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280711#comment-13280711 ] 

Edward J. Yoon commented on HAMA-531:
-------------------------------------

In BSP case, it looks like pre-partitioned data is needed.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281746#comment-13281746 ] 

Thomas Jungblut commented on HAMA-531:
--------------------------------------

Thanks Edward, you are right.
I've just observed it in the testcases:
{noformat}
12/05/23 19:41:37 INFO bsp.BSPJobClient: Running job: job_localrunner_0001
12/05/23 19:41:37 ERROR bsp.LocalBSPRunner: Exception during BSP execution!
java.io.IOException: org.apache.hama.graph.VertexWritable@78092b6f read 42 bytes, should read 49
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2129)
	at org.apache.hama.bsp.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
	at org.apache.hama.bsp.TrackedRecordReader.moveToNext(TrackedRecordReader.java:60)
	at org.apache.hama.bsp.TrackedRecordReader.next(TrackedRecordReader.java:46)
	at org.apache.hama.bsp.BSPPeerImpl.readNext(BSPPeerImpl.java:495)
	at org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:395)
{noformat}

I fix this in HAMA-580.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut updated HAMA-531:
---------------------------------

    Attachment: HAMA-531_2.patch

works quite okay, besides the edge weight serialization quite breaks the stuff.

Don't know if it is related to NullWritable or if I have another mistake, however I will fix that soon. Currently this part is deactivated. 
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281494#comment-13281494 ] 

Edward J. Yoon commented on HAMA-531:
-------------------------------------

{code}
12/05/23 19:03:17 DEBUG graph.GraphJobRunner: Combiner class: org.apache.hama.examples.SSSP$MinIntCombiner
12/05/23 19:03:17 DEBUG graph.GraphJobRunner: vertex class: org.apache.hama.examples.SSSP$ShortestPathVertex
12/05/23 19:03:17 ERROR bsp.BSPTask: Error running bsp setup and bsp function.
java.io.IOException: org.apache.hadoop.io.Text read 31 bytes, should read 190
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2129)
	at org.apache.hama.bsp.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
	at org.apache.hama.bsp.TrackedRecordReader.moveToNext(TrackedRecordReader.java:60)
	at org.apache.hama.bsp.TrackedRecordReader.next(TrackedRecordReader.java:46)
	at org.apache.hama.bsp.BSPPeerImpl.readNext(BSPPeerImpl.java:482)
	at org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:280)
	at org.apache.hama.graph.GraphJobRunner.setup(GraphJobRunner.java:113)
	at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166)
	at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
	at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1097)
12/05/23 19:03:17 INFO zookeeper.ZooKeeper: Session: 0x137792074af000e closed
12/05/23 19:03:17 INFO zookeeper.ClientCnxn: EventThread shut down
12/05/23 19:03:17 ERROR bsp.BSPTask: Shutting down ping service.
12/05/23 19:03:17 FATAL bsp.GroomServer: Error running child
java.io.IOException: org.apache.hadoop.io.Text read 31 bytes, should read 190
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2129)
	at org.apache.hama.bsp.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
	at org.apache.hama.bsp.TrackedRecordReader.moveToNext(TrackedRecordReader.java:60)
	at org.apache.hama.bsp.TrackedRecordReader.next(TrackedRecordReader.java:46)
	at org.apache.hama.bsp.BSPPeerImpl.readNext(BSPPeerImpl.java:482)
	at org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:280)
	at org.apache.hama.graph.GraphJobRunner.setup(GraphJobRunner.java:113)
	at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166)
	at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
	at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1097)
java.io.IOException: org.apache.hadoop.io.Text read 31 bytes, should read 190
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2129)
	at org.apache.hama.bsp.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
	at org.apache.hama.bsp.TrackedRecordReader.moveToNext(TrackedRecordReader.java:60)
	at org.apache.hama.bsp.TrackedRecordReader.next(TrackedRecordReader.java:46)
	at org.apache.hama.bsp.BSPPeerImpl.readNext(BSPPeerImpl.java:482)
	at org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:280)
	at org.apache.hama.graph.GraphJobRunner.setup(GraphJobRunner.java:113)
	at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166)
	at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
	at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1097)
{code}
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "praveen sripati (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280710#comment-13280710 ] 

praveen sripati commented on HAMA-531:
--------------------------------------

>However we should think about how we build the pre-job partitioner.

Thomas - HAMA-561 has been created for processing already partitioned files.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281504#comment-13281504 ] 

Thomas Jungblut commented on HAMA-531:
--------------------------------------

Hf, however the question is if the input is broken or the reader. The split is handled by the sequencefile and not by the filesystem. I take a closer look then. Thanks for the observation.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280708#comment-13280708 ] 

Edward J. Yoon commented on HAMA-531:
-------------------------------------

{code}
I take a first shot for the graph algorithms.
I guess we should distinct between pre-job partitioning and runtime partitioning. For graph algorithms we can use runtime partitioning.
For other algorithms this might not be suitable.
{code}

+1
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "praveen sripati (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280721#comment-13280721 ] 

praveen sripati commented on HAMA-531:
--------------------------------------

The partitioned data can be optionally written to HDFS so that another BSP (or MR) job can be run without partitioning the data again. HAMA-577 has been created for the same.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281486#comment-13281486 ] 

Edward J. Yoon commented on HAMA-531:
-------------------------------------

It seems there's a bug in partitioning.

{code}
edward@slave:~/workspace/hama-trunk$ bin/hama jar examples/target/hama-examples-0.5.0-incubating-SNAPSHOT.jar sssp 3 /user/edward/data/part-r-00000 output 3
12/05/23 18:46:33 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs://slave.udanax.org:9001/tmp/hadoop-edward/bsp/system/submit_mue8lf
12/05/23 18:46:33 DEBUG bsp.BSPJobClient: Creating splits at hdfs://slave.udanax.org:9001/tmp/hadoop-edward/bsp/system/submit_mue8lf/job.split
12/05/23 18:46:33 INFO bsp.FileInputFormat: Total input paths to process : 1
12/05/23 18:46:33 DEBUG bsp.FileInputFormat: computeSplitSize: 70724 (70724, 2000, 67108864)
12/05/23 18:46:33 INFO bsp.FileInputFormat: Total # of splits: 3
12/05/23 18:46:33 INFO bsp.BSPJobClient: Running job: job_201205231839_0009
12/05/23 18:46:36 INFO bsp.BSPJobClient: Current supersteps number: 0
12/05/23 18:46:39 INFO bsp.BSPJobClient: Current supersteps number: 12
12/05/23 18:46:39 INFO bsp.BSPJobClient: The total number of supersteps: 12
12/05/23 18:46:39 DEBUG bsp.Counters: Adding SUPERSTEPS
12/05/23 18:46:39 INFO bsp.BSPJobClient: Counters: 10
12/05/23 18:46:39 INFO bsp.BSPJobClient:   org.apache.hama.bsp.JobInProgress$JobCounter
12/05/23 18:46:39 INFO bsp.BSPJobClient:     LAUNCHED_TASKS=3
12/05/23 18:46:39 INFO bsp.BSPJobClient:   org.apache.hama.bsp.BSPPeerImpl$PeerCounter
12/05/23 18:46:39 INFO bsp.BSPJobClient:     SUPERSTEPS=12
12/05/23 18:46:39 INFO bsp.BSPJobClient:     COMPRESSED_BYTES_SENT=27902
12/05/23 18:46:39 INFO bsp.BSPJobClient:     SUPERSTEP_SUM=36
12/05/23 18:46:39 INFO bsp.BSPJobClient:     TIME_IN_SYNC_MS=4369
12/05/23 18:46:39 INFO bsp.BSPJobClient:     IO_BYTES_READ=212069
12/05/23 18:46:39 INFO bsp.BSPJobClient:     COMPRESSED_BYTES_RECEIVED=27902
12/05/23 18:46:39 INFO bsp.BSPJobClient:     TOTAL_MESSAGES_SENT=4374
12/05/23 18:46:39 INFO bsp.BSPJobClient:     TASK_INPUT_RECORDS=100
12/05/23 18:46:39 INFO bsp.BSPJobClient:     TOTAL_MESSAGES_RECEIVED=2187
Job Finished in 6.517 seconds
edward@slave:~/workspace/hama-trunk$ bin/hama jar examples/target/hama-examples-0.5.0-incubating-SNAPSHOT.jar sssp 3 /user/edward/data/part-r-00000 output 4
12/05/23 18:46:44 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs://slave.udanax.org:9001/tmp/hadoop-edward/bsp/system/submit_a44pqb
12/05/23 18:46:44 DEBUG bsp.BSPJobClient: Creating splits at hdfs://slave.udanax.org:9001/tmp/hadoop-edward/bsp/system/submit_a44pqb/job.split
12/05/23 18:46:44 INFO bsp.FileInputFormat: Total input paths to process : 1
12/05/23 18:46:44 DEBUG bsp.FileInputFormat: computeSplitSize: 53043 (53043, 2000, 67108864)
12/05/23 18:46:44 INFO bsp.FileInputFormat: Total # of splits: 4
12/05/23 18:46:44 INFO bsp.BSPJobClient: Running job: job_201205231839_0010
12/05/23 18:46:47 INFO bsp.BSPJobClient: Current supersteps number: 0
12/05/23 18:46:56 INFO bsp.BSPJobClient: Job failed.
{code}
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280134#comment-13280134 ] 

Thomas Jungblut commented on HAMA-531:
--------------------------------------

I take a first shot for the graph algorithms.
I guess we should distinct between pre-job partitioning and runtime partitioning. For graph algorithms we can use runtime partitioning.
For other algorithms this might not be suitable.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281498#comment-13281498 ] 

Edward J. Yoon commented on HAMA-531:
-------------------------------------

{quote}If you don't provide task number does it work as well?{quote}

No. I think it works only on single machine.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut reassigned HAMA-531:
------------------------------------

    Assignee: Thomas Jungblut
    
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279844#comment-13279844 ] 

Thomas Jungblut commented on HAMA-531:
--------------------------------------

Sounds reasonable to schedule a BSP. Should we put this into 0.5.0?
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280727#comment-13280727 ] 

Edward J. Yoon commented on HAMA-531:
-------------------------------------

I think there're some issues related with reopenInput() function, ..., etc. 
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Edward J. Yoon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13225870#comment-13225870 ] 

Edward J. Yoon commented on HAMA-531:
-------------------------------------

In my opinion, each task should load the locally assigned data and transfer the data with optimized way across the network.

                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281501#comment-13281501 ] 

Edward J. Yoon commented on HAMA-531:
-------------------------------------

Haha, I'm heading out to dinner. cu again.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch
>
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Posted by "Thomas Jungblut (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13225931#comment-13225931 ] 

Thomas Jungblut commented on HAMA-531:
--------------------------------------

That could be an idea.
We could also schedule a MapReduce job to do this partitioning when a cluster is available. Or we could schedule a BSP Job to do this like you said. 
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira