You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ryan Compton <co...@gmail.com> on 2014/04/24 03:20:11 UTC

GraphX: Help understanding the limitations of Pregel

I'm trying shoehorn a label propagation-ish algorithm into GraphX. I
need to update each vertex with the median value of their neighbors.
Unlike PageRank, which updates each vertex with the mean of their
neighbors, I don't have a simple commutative and associative function
to use for mergeMsg.

What are my options? It looks like I can choose between:

1. a hacky mergeMsg (i.e. combine a,b -> Array(a,b) and then do the
median in vprog)
2. collectNeighbors and then median
3. ignore GraphX and just do the whole thing with joins (which I
actually got working, but its slow)

Is there another possibility that I'm missing?

Re: GraphX: Help understanding the limitations of Pregel

Posted by Ankur Dave <an...@gmail.com>.
If you need access to all message values in vprog, there's nothing wrong
with building up an array in mergeMsg (option #1). This is what
org.apache.spark.graphx.lib.TriangleCount does, though with sets instead of
arrays. There will be a performance penalty because of the communication,
but it sounds like that's unavoidable here.

Ankur <http://www.ankurdave.com/>

On Wed, Apr 23, 2014 at 8:20 PM, Ryan Compton <compton.ryan@gmail.com
> wrote:

> 1. a hacky mergeMsg (i.e. combine a,b -> Array(a,b) and then do the
> median in vprog)
>

Re: GraphX: Help understanding the limitations of Pregel

Posted by Ryan Compton <co...@gmail.com>.
Whoops, I should have mentioned that it's a multivariate median (cf
http://www.pnas.org/content/97/4/1423.full.pdf ). It's easy to compute
when all the values are accessible at once. I'm not sure it's possible
with a combiner. So, I guess the question should be: "Can I use
GraphX's Pregel without a combiner?"

On Wed, Apr 23, 2014 at 7:01 PM, Tom Vacek <mi...@gmail.com> wrote:
> Here are some out-of-the-box ideas:  If the elements lie in a fairly small
> range and/or you're willing to work with limited precision, you could use
> counting sort.  Moreover, you could iteratively find the median using
> bisection, which would be associative and commutative.  It's easy to think
> of improvements that would make this approach give a reasonable answer in a
> few iterations.  I have no idea about mixing algorithmic iterations with
> median-finding iterations.
>
>
> On Wed, Apr 23, 2014 at 8:20 PM, Ryan Compton <co...@gmail.com>
> wrote:
>>
>> I'm trying shoehorn a label propagation-ish algorithm into GraphX. I
>> need to update each vertex with the median value of their neighbors.
>> Unlike PageRank, which updates each vertex with the mean of their
>> neighbors, I don't have a simple commutative and associative function
>> to use for mergeMsg.
>>
>> What are my options? It looks like I can choose between:
>>
>> 1. a hacky mergeMsg (i.e. combine a,b -> Array(a,b) and then do the
>> median in vprog)
>> 2. collectNeighbors and then median
>> 3. ignore GraphX and just do the whole thing with joins (which I
>> actually got working, but its slow)
>>
>> Is there another possibility that I'm missing?
>
>

Re: GraphX: Help understanding the limitations of Pregel

Posted by Tom Vacek <mi...@gmail.com>.
Here are some out-of-the-box ideas:  If the elements lie in a fairly small
range and/or you're willing to work with limited precision, you could use
counting sort.  Moreover, you could iteratively find the median using
bisection, which would be associative and commutative.  It's easy to think
of improvements that would make this approach give a reasonable answer in a
few iterations.  I have no idea about mixing algorithmic iterations with
median-finding iterations.


On Wed, Apr 23, 2014 at 8:20 PM, Ryan Compton <co...@gmail.com>wrote:

> I'm trying shoehorn a label propagation-ish algorithm into GraphX. I
> need to update each vertex with the median value of their neighbors.
> Unlike PageRank, which updates each vertex with the mean of their
> neighbors, I don't have a simple commutative and associative function
> to use for mergeMsg.
>
> What are my options? It looks like I can choose between:
>
> 1. a hacky mergeMsg (i.e. combine a,b -> Array(a,b) and then do the
> median in vprog)
> 2. collectNeighbors and then median
> 3. ignore GraphX and just do the whole thing with joins (which I
> actually got working, but its slow)
>
> Is there another possibility that I'm missing?
>