You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by "praveen sripati (JIRA)" <ji...@apache.org> on 2012/05/20 18:09:40 UTC
[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279804#comment-13279804 ] 

praveen sripati commented on HAMA-531:
--------------------------------------

I haven't gathered any performance metrics, but partitioning in the BSPJobClient (on the node on which the BSP job is submitted) seems to be not very efficient. Moving the data partitioning from the BSPJobClient to do the processing parallely will cut short the total time for processing drastically. So, I am interested in getting some thought process going on around this JIRA.

In the JIRA two approaches have been mentioned.

1. Using BSP to partition the data.
2. Using MR to partition the data.

Using the MR approach

	- The data has to be read by the mappers (READ)
	- The output of the mapper has to be written the file system (WRITE)
	- Reducers have to read the data back from the file system (READ)
	- Reducers process and write the data back to HDFS (WRITE)
	- The BSP Job reads the MR output (READ) and does the processing

So, there are 3 Reads and 2 Writes, before the data is actually processed by the BSP Job.

Using the BSP Job

	- The data is read by the BSP Task (READ)
	- BSP task checks which task the record belongs to using the partitioner and sends the message to the appropriate task.
	- Global Sync
	- The bsp tasks write data to HDFS (optional WRITE)
	- The various bsp tasks receive the message and start processing immediately.

So, there is only 1 Read.

Partitioning using BSP seems to be much faster when compared to MR. The only advantage I see of the MR approach is that since the partitioned data is written to the disk, the same BSP job can be run multiple times without any partitioning the data again. Of course, the BSP tasks could also write the partitioned data to the HDFS to be processed later if required. I don't see any obvious advantage using the MR approach over BSP approach.

Does anyone know how it is done in Giraph?
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>
> The re-partitioning the data is a very expensive operation. By the way, currently, we processes read/write operations sequentially using HDFS api in BSPJobClient from client-side. This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira