You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@giraph.apache.org by David Koch <og...@googlemail.com> on 2013/01/28 14:52:48 UTC

Multiple node types in Giraph and doing a selective M/R over one of them

Hello,

In Giraph is it possible to have different node types in a graph and have a
Map/Reduce only iterate over nodes of this type and their direct successors?

If it sounds a bit cryptic here is something more about our use-case:
We have different HBase tables which we want to "pseudo-join" in Map/Reduce
computations. The node types I mentioned above correspond to the respective
row-key types used in each of those tables, edges are generated by the fact
that the KeyValues in each table can contain row-key values found in one of
the other tables.

The graph would describe these relations. In a Map/Reduce I then want to be
able to iterate over all nodes of a given type while also having access to
a node's successor nodes in the same Mapper instance or better yet the same
map() call. One would then carry out additional Gets to retrieve the data
from the tables thus doing a fairly crude join.

The Graph is likely to change so it would be nice if it could be updated
incrementally.

Does all this sound like something that would be possible with Giraph?

Thank you,

/David

Re: Multiple node types in Giraph and doing a selective M/R over one of them

Posted by Eli Reisman <ap...@gmail.com>.

Right. If your use case boils down to a join, we're probably not the ideal
tool.


On Tue, Jan 29, 2013 at 1:26 AM, Claudio Martella <
claudio.martella@gmail.com> wrote:

> That is correct, but that is not the reason. We use M/R for resource
> allocation, but we do not inherit the limits of the M/R paradigm for
> graphs. The thing is that you probably have a lot of data (I assume it
> because you are using HBase, hence it is difficult for you to fit it all
> into memory), and you do not have an iterative process (hence you do not
> hit the costs of multiple M/R jobs). Plus, the join is a well understood
> problem in M/R.
>
>
> On Tue, Jan 29, 2013 at 9:52 AM, David Koch <og...@googlemail.com> wrote:
>
>> Hello Claudio and Eli,
>>
>> Thank you for your answers. As far as Map/Reduce being a better tool for
>> the job - I was under the impression that Giraph relies on the M/R
>> framework. It seems like it when I check the console output of the examples
>> on the project's Wiki.
>>
>> Again, thank you.
>>
>> /David
>>
>>
>> On Mon, Jan 28, 2013 at 8:49 PM, Claudio Martella <
>> claudio.martella@gmail.com> wrote:
>>
>>> One more general point would be whether giraph is a better tool for your
>>> problem. From my understanding, map reduce is probably the way to go.
>>>
>>>
>>> On Monday, January 28, 2013, Eli Reisman wrote:
>>>
>>>> I agree, something like this is possible using the vertex value. In
>>>> giraph, we now have native support for multigraphs, but before we had that
>>>> support, I described a kind of "cheat" to process multigraphs. You could
>>>> use a variation of that same cheat (its on the site confluence wiki) to do
>>>> what you're talking about I think, even though you're not dealing with a
>>>> multigraph in the problem you described. Essentially, you can get clever
>>>> about what sort of Writable you use for the vertex value type, and/or what
>>>> the values it holds can represent in your dataset.
>>>>
>>>> Alternately, in the off chance that the row-keys do not repeat in the
>>>> tables, then really the "row key" can be a Writable vertex ID as long as
>>>> each is unique .The only repetition would be the fact that other rows with
>>>> their own unique row-keys contain row values that mark out-edges to other
>>>> unique row-keys in the table, but more than once since any row-key could
>>>> have lots of other rows "pointing" an out-edge value towards it. Thinking
>>>> of each row key as unique vertex ID then just turns this into a vanilla
>>>> graph. However, if the row keys are not unique in among all your tables,
>>>> this oversimplifies the problem and you really are stuck wtih the above
>>>> vertex value option.
>>>>
>>>> My point: Giraph has vertex value, ID, out-edge-to-other-vertex ID's,
>>>> and message data types, and as long as the properties required of each for
>>>> a graph are met, and each is a Writable, you can think of the problem
>>>> (often) in one of several ways that Giraph can support.
>>>>
>>>> One last thought: assuming the graph does not mutate during processing,
>>>> you could also write a custom input format that evaluates each row as it
>>>> builds it into a graph vertex data structure, and chooses only row keys
>>>> that are of a certain classification in your use case to make into graph
>>>> data for that job run, simply skipping the other rows as it reads them.
>>>> again, this "solution" depends on the nature of your problem. Just
>>>> something to play with.
>>>>
>>>> Good luck with your use case!
>>>>
>>>> On Mon, Jan 28, 2013 at 7:09 AM, Claudio Martella <
>>>> claudio.martella@gmail.com> wrote:
>>>>
>>>>> Giraph does not support multipartite graph in a natural way. But you
>>>>> can try to model your different sets through the vertexvalue. You can then
>>>>> propagate it (by composing with the ID?) to the neighbors, and obtain your
>>>>> join.
>>>>>
>>>>>
>>>>> On Mon, Jan 28, 2013 at 2:52 PM, David Koch <og...@googlemail.com>wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> In Giraph is it possible to have different node types in a graph and
>>>>>> have a Map/Reduce only iterate over nodes of this type and their direct
>>>>>> successors?
>>>>>>
>>>>>> If it sounds a bit cryptic here is something more about our use-case:
>>>>>> We have different HBase tables which we want to "pseudo-join" in
>>>>>> Map/Reduce computations. The node types I mentioned above correspond to the
>>>>>> respective row-key types used in each of those tables, edges are generated
>>>>>> by the fact that the KeyValues in each table can contain row-key values
>>>>>> found in one of the other tables.
>>>>>>
>>>>>> The graph would describe these relations. In a Map/Reduce I then want
>>>>>> to be able to iterate over all nodes of a given type while also having
>>>>>> access to a node's successor nodes in the same Mapper instance or better
>>>>>> yet the same map() call. One would then carry out additional Gets to
>>>>>> retrieve the data from the tables thus doing a fairly crude join.
>>>>>>
>>>>>> The Graph is likely to change so it would be nice if it could be
>>>>>> updated incrementally.
>>>>>>
>>>>>> Does all this sound like something that would be possible with Giraph?
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> /David
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>    Claudio Martella
>>>>>    claudio.martella@gmail.com
>>>>>
>>>>
>>>>
>>>
>>> --
>>>    Claudio Martella
>>>    claudio.martella@gmail.com
>>>
>>
>>
>
>
> --
>    Claudio Martella
>    claudio.martella@gmail.com
>

Re: Multiple node types in Giraph and doing a selective M/R over one of them

Posted by Claudio Martella <cl...@gmail.com>.

That is correct, but that is not the reason. We use M/R for resource
allocation, but we do not inherit the limits of the M/R paradigm for
graphs. The thing is that you probably have a lot of data (I assume it
because you are using HBase, hence it is difficult for you to fit it all
into memory), and you do not have an iterative process (hence you do not
hit the costs of multiple M/R jobs). Plus, the join is a well understood
problem in M/R.


On Tue, Jan 29, 2013 at 9:52 AM, David Koch <og...@googlemail.com> wrote:

> Hello Claudio and Eli,
>
> Thank you for your answers. As far as Map/Reduce being a better tool for
> the job - I was under the impression that Giraph relies on the M/R
> framework. It seems like it when I check the console output of the examples
> on the project's Wiki.
>
> Again, thank you.
>
> /David
>
>
> On Mon, Jan 28, 2013 at 8:49 PM, Claudio Martella <
> claudio.martella@gmail.com> wrote:
>
>> One more general point would be whether giraph is a better tool for your
>> problem. From my understanding, map reduce is probably the way to go.
>>
>>
>> On Monday, January 28, 2013, Eli Reisman wrote:
>>
>>> I agree, something like this is possible using the vertex value. In
>>> giraph, we now have native support for multigraphs, but before we had that
>>> support, I described a kind of "cheat" to process multigraphs. You could
>>> use a variation of that same cheat (its on the site confluence wiki) to do
>>> what you're talking about I think, even though you're not dealing with a
>>> multigraph in the problem you described. Essentially, you can get clever
>>> about what sort of Writable you use for the vertex value type, and/or what
>>> the values it holds can represent in your dataset.
>>>
>>> Alternately, in the off chance that the row-keys do not repeat in the
>>> tables, then really the "row key" can be a Writable vertex ID as long as
>>> each is unique .The only repetition would be the fact that other rows with
>>> their own unique row-keys contain row values that mark out-edges to other
>>> unique row-keys in the table, but more than once since any row-key could
>>> have lots of other rows "pointing" an out-edge value towards it. Thinking
>>> of each row key as unique vertex ID then just turns this into a vanilla
>>> graph. However, if the row keys are not unique in among all your tables,
>>> this oversimplifies the problem and you really are stuck wtih the above
>>> vertex value option.
>>>
>>> My point: Giraph has vertex value, ID, out-edge-to-other-vertex ID's,
>>> and message data types, and as long as the properties required of each for
>>> a graph are met, and each is a Writable, you can think of the problem
>>> (often) in one of several ways that Giraph can support.
>>>
>>> One last thought: assuming the graph does not mutate during processing,
>>> you could also write a custom input format that evaluates each row as it
>>> builds it into a graph vertex data structure, and chooses only row keys
>>> that are of a certain classification in your use case to make into graph
>>> data for that job run, simply skipping the other rows as it reads them.
>>> again, this "solution" depends on the nature of your problem. Just
>>> something to play with.
>>>
>>> Good luck with your use case!
>>>
>>> On Mon, Jan 28, 2013 at 7:09 AM, Claudio Martella <
>>> claudio.martella@gmail.com> wrote:
>>>
>>>> Giraph does not support multipartite graph in a natural way. But you
>>>> can try to model your different sets through the vertexvalue. You can then
>>>> propagate it (by composing with the ID?) to the neighbors, and obtain your
>>>> join.
>>>>
>>>>
>>>> On Mon, Jan 28, 2013 at 2:52 PM, David Koch <og...@googlemail.com>wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> In Giraph is it possible to have different node types in a graph and
>>>>> have a Map/Reduce only iterate over nodes of this type and their direct
>>>>> successors?
>>>>>
>>>>> If it sounds a bit cryptic here is something more about our use-case:
>>>>> We have different HBase tables which we want to "pseudo-join" in
>>>>> Map/Reduce computations. The node types I mentioned above correspond to the
>>>>> respective row-key types used in each of those tables, edges are generated
>>>>> by the fact that the KeyValues in each table can contain row-key values
>>>>> found in one of the other tables.
>>>>>
>>>>> The graph would describe these relations. In a Map/Reduce I then want
>>>>> to be able to iterate over all nodes of a given type while also having
>>>>> access to a node's successor nodes in the same Mapper instance or better
>>>>> yet the same map() call. One would then carry out additional Gets to
>>>>> retrieve the data from the tables thus doing a fairly crude join.
>>>>>
>>>>> The Graph is likely to change so it would be nice if it could be
>>>>> updated incrementally.
>>>>>
>>>>> Does all this sound like something that would be possible with Giraph?
>>>>>
>>>>> Thank you,
>>>>>
>>>>> /David
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>    Claudio Martella
>>>>    claudio.martella@gmail.com
>>>>
>>>
>>>
>>
>> --
>>    Claudio Martella
>>    claudio.martella@gmail.com
>>
>
>


-- 
   Claudio Martella
   claudio.martella@gmail.com

Re: Multiple node types in Giraph and doing a selective M/R over one of them

Posted by David Koch <og...@googlemail.com>.

Hello Claudio and Eli,

Thank you for your answers. As far as Map/Reduce being a better tool for
the job - I was under the impression that Giraph relies on the M/R
framework. It seems like it when I check the console output of the examples
on the project's Wiki.

Again, thank you.

/David

On Mon, Jan 28, 2013 at 8:49 PM, Claudio Martella <
claudio.martella@gmail.com> wrote:

> One more general point would be whether giraph is a better tool for your
> problem. From my understanding, map reduce is probably the way to go.
>
>
> On Monday, January 28, 2013, Eli Reisman wrote:
>
>> I agree, something like this is possible using the vertex value. In
>> giraph, we now have native support for multigraphs, but before we had that
>> support, I described a kind of "cheat" to process multigraphs. You could
>> use a variation of that same cheat (its on the site confluence wiki) to do
>> what you're talking about I think, even though you're not dealing with a
>> multigraph in the problem you described. Essentially, you can get clever
>> about what sort of Writable you use for the vertex value type, and/or what
>> the values it holds can represent in your dataset.
>>
>> Alternately, in the off chance that the row-keys do not repeat in the
>> tables, then really the "row key" can be a Writable vertex ID as long as
>> each is unique .The only repetition would be the fact that other rows with
>> their own unique row-keys contain row values that mark out-edges to other
>> unique row-keys in the table, but more than once since any row-key could
>> have lots of other rows "pointing" an out-edge value towards it. Thinking
>> of each row key as unique vertex ID then just turns this into a vanilla
>> graph. However, if the row keys are not unique in among all your tables,
>> this oversimplifies the problem and you really are stuck wtih the above
>> vertex value option.
>>
>> My point: Giraph has vertex value, ID, out-edge-to-other-vertex ID's, and
>> message data types, and as long as the properties required of each for a
>> graph are met, and each is a Writable, you can think of the problem (often)
>> in one of several ways that Giraph can support.
>>
>> One last thought: assuming the graph does not mutate during processing,
>> you could also write a custom input format that evaluates each row as it
>> builds it into a graph vertex data structure, and chooses only row keys
>> that are of a certain classification in your use case to make into graph
>> data for that job run, simply skipping the other rows as it reads them.
>> again, this "solution" depends on the nature of your problem. Just
>> something to play with.
>>
>> Good luck with your use case!
>>
>> On Mon, Jan 28, 2013 at 7:09 AM, Claudio Martella <
>> claudio.martella@gmail.com> wrote:
>>
>>> Giraph does not support multipartite graph in a natural way. But you can
>>> try to model your different sets through the vertexvalue. You can then
>>> propagate it (by composing with the ID?) to the neighbors, and obtain your
>>> join.
>>>
>>>
>>> On Mon, Jan 28, 2013 at 2:52 PM, David Koch <og...@googlemail.com>wrote:
>>>
>>>> Hello,
>>>>
>>>> In Giraph is it possible to have different node types in a graph and
>>>> have a Map/Reduce only iterate over nodes of this type and their direct
>>>> successors?
>>>>
>>>> If it sounds a bit cryptic here is something more about our use-case:
>>>> We have different HBase tables which we want to "pseudo-join" in
>>>> Map/Reduce computations. The node types I mentioned above correspond to the
>>>> respective row-key types used in each of those tables, edges are generated
>>>> by the fact that the KeyValues in each table can contain row-key values
>>>> found in one of the other tables.
>>>>
>>>> The graph would describe these relations. In a Map/Reduce I then want
>>>> to be able to iterate over all nodes of a given type while also having
>>>> access to a node's successor nodes in the same Mapper instance or better
>>>> yet the same map() call. One would then carry out additional Gets to
>>>> retrieve the data from the tables thus doing a fairly crude join.
>>>>
>>>> The Graph is likely to change so it would be nice if it could be
>>>> updated incrementally.
>>>>
>>>> Does all this sound like something that would be possible with Giraph?
>>>>
>>>> Thank you,
>>>>
>>>> /David
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>    Claudio Martella
>>>    claudio.martella@gmail.com
>>>
>>
>>
>
> --
>    Claudio Martella
>    claudio.martella@gmail.com
>

Re: Multiple node types in Giraph and doing a selective M/R over one of them

Posted by Claudio Martella <cl...@gmail.com>.

One more general point would be whether giraph is a better tool for your
problem. From my understanding, map reduce is probably the way to go.

On Monday, January 28, 2013, Eli Reisman wrote:

> I agree, something like this is possible using the vertex value. In
> giraph, we now have native support for multigraphs, but before we had that
> support, I described a kind of "cheat" to process multigraphs. You could
> use a variation of that same cheat (its on the site confluence wiki) to do
> what you're talking about I think, even though you're not dealing with a
> multigraph in the problem you described. Essentially, you can get clever
> about what sort of Writable you use for the vertex value type, and/or what
> the values it holds can represent in your dataset.
>
> Alternately, in the off chance that the row-keys do not repeat in the
> tables, then really the "row key" can be a Writable vertex ID as long as
> each is unique .The only repetition would be the fact that other rows with
> their own unique row-keys contain row values that mark out-edges to other
> unique row-keys in the table, but more than once since any row-key could
> have lots of other rows "pointing" an out-edge value towards it. Thinking
> of each row key as unique vertex ID then just turns this into a vanilla
> graph. However, if the row keys are not unique in among all your tables,
> this oversimplifies the problem and you really are stuck wtih the above
> vertex value option.
>
> My point: Giraph has vertex value, ID, out-edge-to-other-vertex ID's, and
> message data types, and as long as the properties required of each for a
> graph are met, and each is a Writable, you can think of the problem (often)
> in one of several ways that Giraph can support.
>
> One last thought: assuming the graph does not mutate during processing,
> you could also write a custom input format that evaluates each row as it
> builds it into a graph vertex data structure, and chooses only row keys
> that are of a certain classification in your use case to make into graph
> data for that job run, simply skipping the other rows as it reads them.
> again, this "solution" depends on the nature of your problem. Just
> something to play with.
>
> Good luck with your use case!
>
> On Mon, Jan 28, 2013 at 7:09 AM, Claudio Martella <
> claudio.martella@gmail.com <javascript:_e({}, 'cvml',
> 'claudio.martella@gmail.com');>> wrote:
>
>> Giraph does not support multipartite graph in a natural way. But you can
>> try to model your different sets through the vertexvalue. You can then
>> propagate it (by composing with the ID?) to the neighbors, and obtain your
>> join.
>>
>>
>> On Mon, Jan 28, 2013 at 2:52 PM, David Koch <ogdude@googlemail.com<javascript:_e({}, 'cvml', 'ogdude@googlemail.com');>
>> > wrote:
>>
>>> Hello,
>>>
>>> In Giraph is it possible to have different node types in a graph and
>>> have a Map/Reduce only iterate over nodes of this type and their direct
>>> successors?
>>>
>>> If it sounds a bit cryptic here is something more about our use-case:
>>> We have different HBase tables which we want to "pseudo-join" in
>>> Map/Reduce computations. The node types I mentioned above correspond to the
>>> respective row-key types used in each of those tables, edges are generated
>>> by the fact that the KeyValues in each table can contain row-key values
>>> found in one of the other tables.
>>>
>>> The graph would describe these relations. In a Map/Reduce I then want to
>>> be able to iterate over all nodes of a given type while also having access
>>> to a node's successor nodes in the same Mapper instance or better yet the
>>> same map() call. One would then carry out additional Gets to retrieve the
>>> data from the tables thus doing a fairly crude join.
>>>
>>> The Graph is likely to change so it would be nice if it could be updated
>>> incrementally.
>>>
>>> Does all this sound like something that would be possible with Giraph?
>>>
>>> Thank you,
>>>
>>> /David
>>>
>>>
>>>
>>>
>>
>>
>> --
>>    Claudio Martella
>>    claudio.martella@gmail.com <javascript:_e({}, 'cvml',
>> 'claudio.martella@gmail.com');>
>>
>
>

-- 
   Claudio Martella
   claudio.martella@gmail.com

Re: Multiple node types in Giraph and doing a selective M/R over one of them

Posted by Eli Reisman <ap...@gmail.com>.

I agree, something like this is possible using the vertex value. In giraph,
we now have native support for multigraphs, but before we had that support,
I described a kind of "cheat" to process multigraphs. You could use a
variation of that same cheat (its on the site confluence wiki) to do what
you're talking about I think, even though you're not dealing with a
multigraph in the problem you described. Essentially, you can get clever
about what sort of Writable you use for the vertex value type, and/or what
the values it holds can represent in your dataset.

Alternately, in the off chance that the row-keys do not repeat in the
tables, then really the "row key" can be a Writable vertex ID as long as
each is unique .The only repetition would be the fact that other rows with
their own unique row-keys contain row values that mark out-edges to other
unique row-keys in the table, but more than once since any row-key could
have lots of other rows "pointing" an out-edge value towards it. Thinking
of each row key as unique vertex ID then just turns this into a vanilla
graph. However, if the row keys are not unique in among all your tables,
this oversimplifies the problem and you really are stuck wtih the above
vertex value option.

My point: Giraph has vertex value, ID, out-edge-to-other-vertex ID's, and
message data types, and as long as the properties required of each for a
graph are met, and each is a Writable, you can think of the problem (often)
in one of several ways that Giraph can support.

One last thought: assuming the graph does not mutate during processing, you
could also write a custom input format that evaluates each row as it builds
it into a graph vertex data structure, and chooses only row keys that are
of a certain classification in your use case to make into graph data for
that job run, simply skipping the other rows as it reads them. again, this
"solution" depends on the nature of your problem. Just something to play
with.

Good luck with your use case!

On Mon, Jan 28, 2013 at 7:09 AM, Claudio Martella <
claudio.martella@gmail.com> wrote:

> Giraph does not support multipartite graph in a natural way. But you can
> try to model your different sets through the vertexvalue. You can then
> propagate it (by composing with the ID?) to the neighbors, and obtain your
> join.
>
>
> On Mon, Jan 28, 2013 at 2:52 PM, David Koch <og...@googlemail.com> wrote:
>
>> Hello,
>>
>> In Giraph is it possible to have different node types in a graph and have
>> a Map/Reduce only iterate over nodes of this type and their direct
>> successors?
>>
>> If it sounds a bit cryptic here is something more about our use-case:
>> We have different HBase tables which we want to "pseudo-join" in
>> Map/Reduce computations. The node types I mentioned above correspond to the
>> respective row-key types used in each of those tables, edges are generated
>> by the fact that the KeyValues in each table can contain row-key values
>> found in one of the other tables.
>>
>> The graph would describe these relations. In a Map/Reduce I then want to
>> be able to iterate over all nodes of a given type while also having access
>> to a node's successor nodes in the same Mapper instance or better yet the
>> same map() call. One would then carry out additional Gets to retrieve the
>> data from the tables thus doing a fairly crude join.
>>
>> The Graph is likely to change so it would be nice if it could be updated
>> incrementally.
>>
>> Does all this sound like something that would be possible with Giraph?
>>
>> Thank you,
>>
>> /David
>>
>>
>>
>>
>
>
> --
>    Claudio Martella
>    claudio.martella@gmail.com
>

Re: Multiple node types in Giraph and doing a selective M/R over one of them

Posted by Claudio Martella <cl...@gmail.com>.

Giraph does not support multipartite graph in a natural way. But you can
try to model your different sets through the vertexvalue. You can then
propagate it (by composing with the ID?) to the neighbors, and obtain your
join.


On Mon, Jan 28, 2013 at 2:52 PM, David Koch <og...@googlemail.com> wrote:

> Hello,
>
> In Giraph is it possible to have different node types in a graph and have
> a Map/Reduce only iterate over nodes of this type and their direct
> successors?
>
> If it sounds a bit cryptic here is something more about our use-case:
> We have different HBase tables which we want to "pseudo-join" in
> Map/Reduce computations. The node types I mentioned above correspond to the
> respective row-key types used in each of those tables, edges are generated
> by the fact that the KeyValues in each table can contain row-key values
> found in one of the other tables.
>
> The graph would describe these relations. In a Map/Reduce I then want to
> be able to iterate over all nodes of a given type while also having access
> to a node's successor nodes in the same Mapper instance or better yet the
> same map() call. One would then carry out additional Gets to retrieve the
> data from the tables thus doing a fairly crude join.
>
> The Graph is likely to change so it would be nice if it could be updated
> incrementally.
>
> Does all this sound like something that would be possible with Giraph?
>
> Thank you,
>
> /David
>
>
>
>


-- 
   Claudio Martella
   claudio.martella@gmail.com