You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Hadoop Explorer <ha...@outlook.com> on 2013/04/18 13:49:24 UTC

will an application with two maps but no reduce be suitable for hadoop?

I have an application that evaluate a graph using this algorithm:

- use a parallel for loop to evaluate all nodes in a graph (to evaluate a node, an image is read, and then result of this node is calculated)

- use a second parallel for loop to evaluate all edges in the graph.  The function would take in results from both nodes of the edge, and then calculate the answer for the edge


As you can see, the above algorithm would employ two map functions, but no reduce function.  The total data size can be very large (say 100GB).  Also, the workload of each node and each edge is highly irregular, and thus load balancing mechanisms are essential.

In this case, will hadoop suit this application?  if so, how will the architecture of my program like?  And will hadoop be able to strike the balance between a good load balancing of the second map function, and minimizing data transfer of the results from the first map function?

Re: will an application with two maps but no reduce be suitable for hadoop?

Posted by Roman Shaposhnik <rv...@apache.org>.

On Thu, Apr 18, 2013 at 4:49 AM, Hadoop Explorer
<ha...@outlook.com> wrote:
> I have an application that evaluate a graph using this algorithm:
>
> - use a parallel for loop to evaluate all nodes in a graph (to evaluate a
> node, an image is read, and then result of this node is calculated)
>
> - use a second parallel for loop to evaluate all edges in the graph.  The
> function would take in results from both nodes of the edge, and then
> calculate the answer for the edge
>
>
> As you can see, the above algorithm would employ two map functions, but no
> reduce function.  The total data size can be very large (say 100GB).  Also,
> the workload of each node and each edge is highly irregular, and thus load
> balancing mechanisms are essential.
>
> In this case, will hadoop suit this application?  if so, how will the
> architecture of my program like?  And will hadoop be able to strike the
> balance between a good load balancing of the second map function, and
> minimizing data transfer of the results from the first map function?

map-only jobs are known in Hadoop ecosystem. For example, that's how
Giraph implements BSP on top of Hadoop. In fact, from what you're
describing it sounds like Giraph could be a good fit. Check it out:
    http://giraph.apache.org/

Thanks,
Roman.

Re: will an application with two maps but no reduce be suitable for hadoop?

Posted by Roman Shaposhnik <rv...@apache.org>.

On Thu, Apr 18, 2013 at 4:49 AM, Hadoop Explorer
<ha...@outlook.com> wrote:
> I have an application that evaluate a graph using this algorithm:
>
> - use a parallel for loop to evaluate all nodes in a graph (to evaluate a
> node, an image is read, and then result of this node is calculated)
>
> - use a second parallel for loop to evaluate all edges in the graph.  The
> function would take in results from both nodes of the edge, and then
> calculate the answer for the edge
>
>
> As you can see, the above algorithm would employ two map functions, but no
> reduce function.  The total data size can be very large (say 100GB).  Also,
> the workload of each node and each edge is highly irregular, and thus load
> balancing mechanisms are essential.
>
> In this case, will hadoop suit this application?  if so, how will the
> architecture of my program like?  And will hadoop be able to strike the
> balance between a good load balancing of the second map function, and
> minimizing data transfer of the results from the first map function?

map-only jobs are known in Hadoop ecosystem. For example, that's how
Giraph implements BSP on top of Hadoop. In fact, from what you're
describing it sounds like Giraph could be a good fit. Check it out:
    http://giraph.apache.org/

Thanks,
Roman.

Re: will an application with two maps but no reduce be suitable for hadoop?

Posted by Roman Shaposhnik <rv...@apache.org>.

On Thu, Apr 18, 2013 at 4:49 AM, Hadoop Explorer
<ha...@outlook.com> wrote:
> I have an application that evaluate a graph using this algorithm:
>
> - use a parallel for loop to evaluate all nodes in a graph (to evaluate a
> node, an image is read, and then result of this node is calculated)
>
> - use a second parallel for loop to evaluate all edges in the graph.  The
> function would take in results from both nodes of the edge, and then
> calculate the answer for the edge
>
>
> As you can see, the above algorithm would employ two map functions, but no
> reduce function.  The total data size can be very large (say 100GB).  Also,
> the workload of each node and each edge is highly irregular, and thus load
> balancing mechanisms are essential.
>
> In this case, will hadoop suit this application?  if so, how will the
> architecture of my program like?  And will hadoop be able to strike the
> balance between a good load balancing of the second map function, and
> minimizing data transfer of the results from the first map function?

map-only jobs are known in Hadoop ecosystem. For example, that's how
Giraph implements BSP on top of Hadoop. In fact, from what you're
describing it sounds like Giraph could be a good fit. Check it out:
    http://giraph.apache.org/

Thanks,
Roman.

Re: will an application with two maps but no reduce be suitable for hadoop?

Posted by Roman Shaposhnik <rv...@apache.org>.

On Thu, Apr 18, 2013 at 4:49 AM, Hadoop Explorer
<ha...@outlook.com> wrote:
> I have an application that evaluate a graph using this algorithm:
>
> - use a parallel for loop to evaluate all nodes in a graph (to evaluate a
> node, an image is read, and then result of this node is calculated)
>
> - use a second parallel for loop to evaluate all edges in the graph.  The
> function would take in results from both nodes of the edge, and then
> calculate the answer for the edge
>
>
> As you can see, the above algorithm would employ two map functions, but no
> reduce function.  The total data size can be very large (say 100GB).  Also,
> the workload of each node and each edge is highly irregular, and thus load
> balancing mechanisms are essential.
>
> In this case, will hadoop suit this application?  if so, how will the
> architecture of my program like?  And will hadoop be able to strike the
> balance between a good load balancing of the second map function, and
> minimizing data transfer of the results from the first map function?

map-only jobs are known in Hadoop ecosystem. For example, that's how
Giraph implements BSP on top of Hadoop. In fact, from what you're
describing it sounds like Giraph could be a good fit. Check it out:
    http://giraph.apache.org/

Thanks,
Roman.