You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Guillermo Ortiz <ko...@gmail.com> on 2016/02/24 23:26:16 UTC

How could I do this algorithm in Spark?

I want to do some algorithm in Spark.. I know how to do it in a single
machine where all data are together, but I don't know a good way to do it
in Spark.

If someone has an idea..
I have some data like this
a , b
x , y
b , c
y , y
c , d

I want something like:
a , d
b , d
c , d
x , y
y , y

I need to know that a->b->c->d, so a->d, b->d and c->d.
I don't want the code, just an idea how I could deal with it.

Any idea?

Re: How could I do this algorithm in Spark?

Posted by Guillermo Ortiz <ko...@gmail.com>.

Thank you!, I'm trying to do it with Pregel,, it's being hard because I
have never used GraphX and Pregel before.

2016-02-25 14:00 GMT+01:00 Sabarish Sasidharan <sa...@gmail.com>:

> Like Robin said, pls explore Pregel. You could do it without Pregel but it
> might be laborious. I have a simple outline below. You will need more
> iterations if the number of levels is higher.
>
> a-b
> b-c
> c-d
> b-e
> e-f
> f-c
>
> flatmaptopair
>
> a -> (a-b)
> b -> (a-b)
> b -> (b-c)
> c -> (b-c)
> c -> (c-d)
> d -> (c-d)
> b -> (b-e)
> e -> (b-e)
> e -> (e-f)
> f -> (e-f)
> f -> (f-c)
> c -> (f-c)
>
> aggregatebykey
>
> a -> (a-b)
> b -> (a-b, b-c, b-e)
> c -> (b-c, c-d, f-c)
> d -> (c-d)
> e -> (b-e, e-f)
> f -> (e-f, f-c)
>
> filter to remove keys with less than 2 values
>
> b -> (a-b, b-c, b-e)
> c -> (b-c, c-d, f-c)
> e -> (b-e, e-f)
> f -> (e-f, f-c)
>
> flatmap
>
> a-b-c
> a-b-e
> b-c-d
> b-e-f
> e-f-c
>
> flatmaptopair followed by aggregatebykey
>
> (a-b) -> (a-b-c, a-b-e)
> (b-c) -> (a-b-c, b-c-d)
> (c-d) -> (b-c-d)
> (b-e) -> (b-e-f)
> (e-f) -> (b-e-f, e-f-c)
> (f-c) -> (e-f-c)
>
> filter out keys with less than 2 values
>
> (b-c) -> (a-b-c, b-c-d)
> (e-f) -> (b-e-f, e-f-c)
>
> mapvalues
>
> a-b-c-d
> b-e-f-c
>
> flatmap
>
> a,d
> b,d
> c,d
> b,c
> e,c
> f,c
>
>
> On Thu, Feb 25, 2016 at 6:19 PM, Guillermo Ortiz <ko...@gmail.com>
> wrote:
>
>> I'm taking a look to Pregel. It seems it's a good way to do it. The only
>> negative thing that I see it's not a really complex graph with a lot of
>> edges between the vertex .. They are more like a lot of isolated small
>> graphs
>>
>> 2016-02-25 12:32 GMT+01:00 Robin East <ro...@xense.co.uk>:
>>
>>> The structures you are describing look like edges of a graph and you
>>> want to follow the graph to a terminal vertex and then propagate that value
>>> back up the path. On this assumption it would be simple to create the
>>> structures as graphs in GraphX and use Pregel for the algorithm
>>> implementation.
>>>
>>> -------------------------------------------------------------------------------
>>> Robin East
>>> *Spark GraphX in Action* Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>>
>>>
>>>
>>>
>>>
>>> On 25 Feb 2016, at 10:52, Guillermo Ortiz <ko...@gmail.com> wrote:
>>>
>>> Oh, the letters were just an example, it could be:
>>> a , t
>>> b, o
>>> t, k
>>> k, c
>>>
>>> So.. a -> t -> k -> c and the result is: a,c; t,c; k,c and b,o
>>> I don't know if you were thinking about sortBy because the another
>>> example where letter were consecutive.
>>>
>>>
>>> 2016-02-25 9:42 GMT+01:00 Guillermo Ortiz <ko...@gmail.com>:
>>>
>>>> I don't see that sorting the data helps.
>>>> The answer has to be all the associations. In this case the answer has
>>>> to be:
>>>> a , b --> it was a error in the question, sorry.
>>>> b , d
>>>> c , d
>>>> x , y
>>>> y , y
>>>>
>>>> I feel like all the data which is associate should be in the same
>>>> executor.
>>>> On this case if I order the inputs.
>>>> a , b
>>>> x , y
>>>> b , c
>>>> y , y
>>>> c , d
>>>> --> to
>>>> a , b
>>>> b , c
>>>> c , d
>>>> x , y
>>>> y , y
>>>>
>>>> Now, a,b ; b,c; one partitions for example, "c,d" and "x,y" another one
>>>> and so on.
>>>> I could get the relation between "a,b,c", but not about "d" with
>>>> "a,b,c", am I wrong? I hope to be wrong!.
>>>>
>>>> It seems that it could be done with GraphX, but as you said, it seems a
>>>> little bit overhead.
>>>>
>>>>
>>>> 2016-02-25 5:43 GMT+01:00 James Barney <ja...@gmail.com>:
>>>>
>>>>> Guillermo,
>>>>> I think you're after an associative algorithm where A is ultimately
>>>>> associated with D, correct? Jakob would correct if that is a typo--a sort
>>>>> would be all that is necessary in that case.
>>>>>
>>>>> I believe you're looking for something else though, if I understand
>>>>> correctly.
>>>>>
>>>>> This seems like a similar algorithm to PageRank, no?
>>>>> https://github.com/amplab/graphx/blob/master/python/examples/pagerank.py
>>>>> Except return the "neighbor" itself, not the necessarily the rank of the
>>>>> page.
>>>>>
>>>>> If you wanted to, use Scala and Graphx for this problem. Might be a
>>>>> bit of overhead though: Construct a node for each member of each tuple with
>>>>> an edge between. Then traverse the graph for all sets of nodes that are
>>>>> connected. That result set would quickly explode in size, but you could
>>>>> restrict results to a minimum N connections. I'm not super familiar with
>>>>> Graphx myself, however. My intuition is saying 'graph problem' though.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>
>>>>> On Wed, Feb 24, 2016 at 6:43 PM, Jakob Odersky <ja...@odersky.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Guillermo,
>>>>>> assuming that the first "a,b" is a typo and you actually meant "a,d",
>>>>>> this is a sorting problem.
>>>>>>
>>>>>> You could easily model your data as an RDD or tuples (or as a
>>>>>> dataframe/set) and use the sortBy (or orderBy for dataframe/sets)
>>>>>> methods.
>>>>>>
>>>>>> best,
>>>>>> --Jakob
>>>>>>
>>>>>> On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz <
>>>>>> konstt2000@gmail.com> wrote:
>>>>>> > I want to do some algorithm in Spark.. I know how to do it in a
>>>>>> single
>>>>>> > machine where all data are together, but I don't know a good way to
>>>>>> do it in
>>>>>> > Spark.
>>>>>> >
>>>>>> > If someone has an idea..
>>>>>> > I have some data like this
>>>>>> > a , b
>>>>>> > x , y
>>>>>> > b , c
>>>>>> > y , y
>>>>>> > c , d
>>>>>> >
>>>>>> > I want something like:
>>>>>> > a , d
>>>>>> > b , d
>>>>>> > c , d
>>>>>> > x , y
>>>>>> > y , y
>>>>>> >
>>>>>> > I need to know that a->b->c->d, so a->d, b->d and c->d.
>>>>>> > I don't want the code, just an idea how I could deal with it.
>>>>>> >
>>>>>> > Any idea?
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: How could I do this algorithm in Spark?

Posted by Sabarish Sasidharan <sa...@gmail.com>.

Like Robin said, pls explore Pregel. You could do it without Pregel but it
might be laborious. I have a simple outline below. You will need more
iterations if the number of levels is higher.

a-b
b-c
c-d
b-e
e-f
f-c

flatmaptopair

a -> (a-b)
b -> (a-b)
b -> (b-c)
c -> (b-c)
c -> (c-d)
d -> (c-d)
b -> (b-e)
e -> (b-e)
e -> (e-f)
f -> (e-f)
f -> (f-c)
c -> (f-c)

aggregatebykey

a -> (a-b)
b -> (a-b, b-c, b-e)
c -> (b-c, c-d, f-c)
d -> (c-d)
e -> (b-e, e-f)
f -> (e-f, f-c)

filter to remove keys with less than 2 values

b -> (a-b, b-c, b-e)
c -> (b-c, c-d, f-c)
e -> (b-e, e-f)
f -> (e-f, f-c)

flatmap

a-b-c
a-b-e
b-c-d
b-e-f
e-f-c

flatmaptopair followed by aggregatebykey

(a-b) -> (a-b-c, a-b-e)
(b-c) -> (a-b-c, b-c-d)
(c-d) -> (b-c-d)
(b-e) -> (b-e-f)
(e-f) -> (b-e-f, e-f-c)
(f-c) -> (e-f-c)

filter out keys with less than 2 values

(b-c) -> (a-b-c, b-c-d)
(e-f) -> (b-e-f, e-f-c)

mapvalues

a-b-c-d
b-e-f-c

flatmap

a,d
b,d
c,d
b,c
e,c
f,c


On Thu, Feb 25, 2016 at 6:19 PM, Guillermo Ortiz <ko...@gmail.com>
wrote:

> I'm taking a look to Pregel. It seems it's a good way to do it. The only
> negative thing that I see it's not a really complex graph with a lot of
> edges between the vertex .. They are more like a lot of isolated small
> graphs
>
> 2016-02-25 12:32 GMT+01:00 Robin East <ro...@xense.co.uk>:
>
>> The structures you are describing look like edges of a graph and you want
>> to follow the graph to a terminal vertex and then propagate that value back
>> up the path. On this assumption it would be simple to create the structures
>> as graphs in GraphX and use Pregel for the algorithm implementation.
>>
>> -------------------------------------------------------------------------------
>> Robin East
>> *Spark GraphX in Action* Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>>
>>
>>
>> On 25 Feb 2016, at 10:52, Guillermo Ortiz <ko...@gmail.com> wrote:
>>
>> Oh, the letters were just an example, it could be:
>> a , t
>> b, o
>> t, k
>> k, c
>>
>> So.. a -> t -> k -> c and the result is: a,c; t,c; k,c and b,o
>> I don't know if you were thinking about sortBy because the another
>> example where letter were consecutive.
>>
>>
>> 2016-02-25 9:42 GMT+01:00 Guillermo Ortiz <ko...@gmail.com>:
>>
>>> I don't see that sorting the data helps.
>>> The answer has to be all the associations. In this case the answer has
>>> to be:
>>> a , b --> it was a error in the question, sorry.
>>> b , d
>>> c , d
>>> x , y
>>> y , y
>>>
>>> I feel like all the data which is associate should be in the same
>>> executor.
>>> On this case if I order the inputs.
>>> a , b
>>> x , y
>>> b , c
>>> y , y
>>> c , d
>>> --> to
>>> a , b
>>> b , c
>>> c , d
>>> x , y
>>> y , y
>>>
>>> Now, a,b ; b,c; one partitions for example, "c,d" and "x,y" another one
>>> and so on.
>>> I could get the relation between "a,b,c", but not about "d" with
>>> "a,b,c", am I wrong? I hope to be wrong!.
>>>
>>> It seems that it could be done with GraphX, but as you said, it seems a
>>> little bit overhead.
>>>
>>>
>>> 2016-02-25 5:43 GMT+01:00 James Barney <ja...@gmail.com>:
>>>
>>>> Guillermo,
>>>> I think you're after an associative algorithm where A is ultimately
>>>> associated with D, correct? Jakob would correct if that is a typo--a sort
>>>> would be all that is necessary in that case.
>>>>
>>>> I believe you're looking for something else though, if I understand
>>>> correctly.
>>>>
>>>> This seems like a similar algorithm to PageRank, no?
>>>> https://github.com/amplab/graphx/blob/master/python/examples/pagerank.py
>>>> Except return the "neighbor" itself, not the necessarily the rank of the
>>>> page.
>>>>
>>>> If you wanted to, use Scala and Graphx for this problem. Might be a bit
>>>> of overhead though: Construct a node for each member of each tuple with an
>>>> edge between. Then traverse the graph for all sets of nodes that are
>>>> connected. That result set would quickly explode in size, but you could
>>>> restrict results to a minimum N connections. I'm not super familiar with
>>>> Graphx myself, however. My intuition is saying 'graph problem' though.
>>>>
>>>> Thoughts?
>>>>
>>>>
>>>> On Wed, Feb 24, 2016 at 6:43 PM, Jakob Odersky <ja...@odersky.com>
>>>> wrote:
>>>>
>>>>> Hi Guillermo,
>>>>> assuming that the first "a,b" is a typo and you actually meant "a,d",
>>>>> this is a sorting problem.
>>>>>
>>>>> You could easily model your data as an RDD or tuples (or as a
>>>>> dataframe/set) and use the sortBy (or orderBy for dataframe/sets)
>>>>> methods.
>>>>>
>>>>> best,
>>>>> --Jakob
>>>>>
>>>>> On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz <ko...@gmail.com>
>>>>> wrote:
>>>>> > I want to do some algorithm in Spark.. I know how to do it in a
>>>>> single
>>>>> > machine where all data are together, but I don't know a good way to
>>>>> do it in
>>>>> > Spark.
>>>>> >
>>>>> > If someone has an idea..
>>>>> > I have some data like this
>>>>> > a , b
>>>>> > x , y
>>>>> > b , c
>>>>> > y , y
>>>>> > c , d
>>>>> >
>>>>> > I want something like:
>>>>> > a , d
>>>>> > b , d
>>>>> > c , d
>>>>> > x , y
>>>>> > y , y
>>>>> >
>>>>> > I need to know that a->b->c->d, so a->d, b->d and c->d.
>>>>> > I don't want the code, just an idea how I could deal with it.
>>>>> >
>>>>> > Any idea?
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: How could I do this algorithm in Spark?

Posted by Guillermo Ortiz <ko...@gmail.com>.

I'm taking a look to Pregel. It seems it's a good way to do it. The only
negative thing that I see it's not a really complex graph with a lot of
edges between the vertex .. They are more like a lot of isolated small
graphs

2016-02-25 12:32 GMT+01:00 Robin East <ro...@xense.co.uk>:

> The structures you are describing look like edges of a graph and you want
> to follow the graph to a terminal vertex and then propagate that value back
> up the path. On this assumption it would be simple to create the structures
> as graphs in GraphX and use Pregel for the algorithm implementation.
>
> -------------------------------------------------------------------------------
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 25 Feb 2016, at 10:52, Guillermo Ortiz <ko...@gmail.com> wrote:
>
> Oh, the letters were just an example, it could be:
> a , t
> b, o
> t, k
> k, c
>
> So.. a -> t -> k -> c and the result is: a,c; t,c; k,c and b,o
> I don't know if you were thinking about sortBy because the another example
> where letter were consecutive.
>
>
> 2016-02-25 9:42 GMT+01:00 Guillermo Ortiz <ko...@gmail.com>:
>
>> I don't see that sorting the data helps.
>> The answer has to be all the associations. In this case the answer has to
>> be:
>> a , b --> it was a error in the question, sorry.
>> b , d
>> c , d
>> x , y
>> y , y
>>
>> I feel like all the data which is associate should be in the same
>> executor.
>> On this case if I order the inputs.
>> a , b
>> x , y
>> b , c
>> y , y
>> c , d
>> --> to
>> a , b
>> b , c
>> c , d
>> x , y
>> y , y
>>
>> Now, a,b ; b,c; one partitions for example, "c,d" and "x,y" another one
>> and so on.
>> I could get the relation between "a,b,c", but not about "d" with "a,b,c",
>> am I wrong? I hope to be wrong!.
>>
>> It seems that it could be done with GraphX, but as you said, it seems a
>> little bit overhead.
>>
>>
>> 2016-02-25 5:43 GMT+01:00 James Barney <ja...@gmail.com>:
>>
>>> Guillermo,
>>> I think you're after an associative algorithm where A is ultimately
>>> associated with D, correct? Jakob would correct if that is a typo--a sort
>>> would be all that is necessary in that case.
>>>
>>> I believe you're looking for something else though, if I understand
>>> correctly.
>>>
>>> This seems like a similar algorithm to PageRank, no?
>>> https://github.com/amplab/graphx/blob/master/python/examples/pagerank.py
>>> Except return the "neighbor" itself, not the necessarily the rank of the
>>> page.
>>>
>>> If you wanted to, use Scala and Graphx for this problem. Might be a bit
>>> of overhead though: Construct a node for each member of each tuple with an
>>> edge between. Then traverse the graph for all sets of nodes that are
>>> connected. That result set would quickly explode in size, but you could
>>> restrict results to a minimum N connections. I'm not super familiar with
>>> Graphx myself, however. My intuition is saying 'graph problem' though.
>>>
>>> Thoughts?
>>>
>>>
>>> On Wed, Feb 24, 2016 at 6:43 PM, Jakob Odersky <ja...@odersky.com>
>>> wrote:
>>>
>>>> Hi Guillermo,
>>>> assuming that the first "a,b" is a typo and you actually meant "a,d",
>>>> this is a sorting problem.
>>>>
>>>> You could easily model your data as an RDD or tuples (or as a
>>>> dataframe/set) and use the sortBy (or orderBy for dataframe/sets)
>>>> methods.
>>>>
>>>> best,
>>>> --Jakob
>>>>
>>>> On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz <ko...@gmail.com>
>>>> wrote:
>>>> > I want to do some algorithm in Spark.. I know how to do it in a single
>>>> > machine where all data are together, but I don't know a good way to
>>>> do it in
>>>> > Spark.
>>>> >
>>>> > If someone has an idea..
>>>> > I have some data like this
>>>> > a , b
>>>> > x , y
>>>> > b , c
>>>> > y , y
>>>> > c , d
>>>> >
>>>> > I want something like:
>>>> > a , d
>>>> > b , d
>>>> > c , d
>>>> > x , y
>>>> > y , y
>>>> >
>>>> > I need to know that a->b->c->d, so a->d, b->d and c->d.
>>>> > I don't want the code, just an idea how I could deal with it.
>>>> >
>>>> > Any idea?
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>
>

Re: How could I do this algorithm in Spark?

Posted by Robin East <ro...@xense.co.uk>.

The structures you are describing look like edges of a graph and you want to follow the graph to a terminal vertex and then propagate that value back up the path. On this assumption it would be simple to create the structures as graphs in GraphX and use Pregel for the algorithm implementation.
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 25 Feb 2016, at 10:52, Guillermo Ortiz <ko...@gmail.com> wrote:
> 
> Oh, the letters were just an example, it could be:
> a , t
> b, o
> t, k
> k, c
> 
> So.. a -> t -> k -> c and the result is: a,c; t,c; k,c and b,o
> I don't know if you were thinking about sortBy because the another example where letter were consecutive.
> 
> 
> 2016-02-25 9:42 GMT+01:00 Guillermo Ortiz <konstt2000@gmail.com <ma...@gmail.com>>:
> I don't see that sorting the data helps. 
> The answer has to be all the associations. In this case the answer has to be:
> a , b --> it was a error in the question, sorry. 
> b , d
> c , d
> x , y
> y , y
>  
> I feel like all the data which is associate should be in the same executor.
> On this case if I order the inputs.
> a , b
> x , y
> b , c
> y , y
> c , d
> --> to
> a , b
> b , c
> c , d
> x , y
> y , y
> 
> Now, a,b ; b,c; one partitions for example, "c,d" and "x,y" another one and so on. 
> I could get the relation between "a,b,c", but not about "d" with "a,b,c", am I wrong? I hope to be wrong!. 
> 
> It seems that it could be done with GraphX, but as you said, it seems a little bit overhead. 
> 
> 
> 2016-02-25 5:43 GMT+01:00 James Barney <jamesbarney71@gmail.com <ma...@gmail.com>>:
> Guillermo, 
> I think you're after an associative algorithm where A is ultimately associated with D, correct? Jakob would correct if that is a typo--a sort would be all that is necessary in that case. 
> 
> I believe you're looking for something else though, if I understand correctly.
> 
> This seems like a similar algorithm to PageRank, no? https://github.com/amplab/graphx/blob/master/python/examples/pagerank.py <https://github.com/amplab/graphx/blob/master/python/examples/pagerank.py> Except return the "neighbor" itself, not the necessarily the rank of the page.
> 
> If you wanted to, use Scala and Graphx for this problem. Might be a bit of overhead though: Construct a node for each member of each tuple with an edge between. Then traverse the graph for all sets of nodes that are connected. That result set would quickly explode in size, but you could restrict results to a minimum N connections. I'm not super familiar with Graphx myself, however. My intuition is saying 'graph problem' though.
> 
> Thoughts?
> 
> 
> On Wed, Feb 24, 2016 at 6:43 PM, Jakob Odersky <jakob@odersky.com <ma...@odersky.com>> wrote:
> Hi Guillermo,
> assuming that the first "a,b" is a typo and you actually meant "a,d",
> this is a sorting problem.
> 
> You could easily model your data as an RDD or tuples (or as a
> dataframe/set) and use the sortBy (or orderBy for dataframe/sets)
> methods.
> 
> best,
> --Jakob
> 
> On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz <konstt2000@gmail.com <ma...@gmail.com>> wrote:
> > I want to do some algorithm in Spark.. I know how to do it in a single
> > machine where all data are together, but I don't know a good way to do it in
> > Spark.
> >
> > If someone has an idea..
> > I have some data like this
> > a , b
> > x , y
> > b , c
> > y , y
> > c , d
> >
> > I want something like:
> > a , d
> > b , d
> > c , d
> > x , y
> > y , y
> >
> > I need to know that a->b->c->d, so a->d, b->d and c->d.
> > I don't want the code, just an idea how I could deal with it.
> >
> > Any idea?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> 
> 
> 
>

Re: How could I do this algorithm in Spark?

Posted by Guillermo Ortiz <ko...@gmail.com>.

Oh, the letters were just an example, it could be:
a , t
b, o
t, k
k, c

So.. a -> t -> k -> c and the result is: a,c; t,c; k,c and b,o
I don't know if you were thinking about sortBy because the another example
where letter were consecutive.


2016-02-25 9:42 GMT+01:00 Guillermo Ortiz <ko...@gmail.com>:

> I don't see that sorting the data helps.
> The answer has to be all the associations. In this case the answer has to
> be:
> a , b --> it was a error in the question, sorry.
> b , d
> c , d
> x , y
> y , y
>
> I feel like all the data which is associate should be in the same executor.
> On this case if I order the inputs.
> a , b
> x , y
> b , c
> y , y
> c , d
> --> to
> a , b
> b , c
> c , d
> x , y
> y , y
>
> Now, a,b ; b,c; one partitions for example, "c,d" and "x,y" another one
> and so on.
> I could get the relation between "a,b,c", but not about "d" with "a,b,c",
> am I wrong? I hope to be wrong!.
>
> It seems that it could be done with GraphX, but as you said, it seems a
> little bit overhead.
>
>
> 2016-02-25 5:43 GMT+01:00 James Barney <ja...@gmail.com>:
>
>> Guillermo,
>> I think you're after an associative algorithm where A is ultimately
>> associated with D, correct? Jakob would correct if that is a typo--a sort
>> would be all that is necessary in that case.
>>
>> I believe you're looking for something else though, if I understand
>> correctly.
>>
>> This seems like a similar algorithm to PageRank, no?
>> https://github.com/amplab/graphx/blob/master/python/examples/pagerank.py
>> Except return the "neighbor" itself, not the necessarily the rank of the
>> page.
>>
>> If you wanted to, use Scala and Graphx for this problem. Might be a bit
>> of overhead though: Construct a node for each member of each tuple with an
>> edge between. Then traverse the graph for all sets of nodes that are
>> connected. That result set would quickly explode in size, but you could
>> restrict results to a minimum N connections. I'm not super familiar with
>> Graphx myself, however. My intuition is saying 'graph problem' though.
>>
>> Thoughts?
>>
>>
>> On Wed, Feb 24, 2016 at 6:43 PM, Jakob Odersky <ja...@odersky.com> wrote:
>>
>>> Hi Guillermo,
>>> assuming that the first "a,b" is a typo and you actually meant "a,d",
>>> this is a sorting problem.
>>>
>>> You could easily model your data as an RDD or tuples (or as a
>>> dataframe/set) and use the sortBy (or orderBy for dataframe/sets)
>>> methods.
>>>
>>> best,
>>> --Jakob
>>>
>>> On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz <ko...@gmail.com>
>>> wrote:
>>> > I want to do some algorithm in Spark.. I know how to do it in a single
>>> > machine where all data are together, but I don't know a good way to do
>>> it in
>>> > Spark.
>>> >
>>> > If someone has an idea..
>>> > I have some data like this
>>> > a , b
>>> > x , y
>>> > b , c
>>> > y , y
>>> > c , d
>>> >
>>> > I want something like:
>>> > a , d
>>> > b , d
>>> > c , d
>>> > x , y
>>> > y , y
>>> >
>>> > I need to know that a->b->c->d, so a->d, b->d and c->d.
>>> > I don't want the code, just an idea how I could deal with it.
>>> >
>>> > Any idea?
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: How could I do this algorithm in Spark?

Posted by Guillermo Ortiz <ko...@gmail.com>.

I don't see that sorting the data helps.
The answer has to be all the associations. In this case the answer has to
be:
a , b --> it was a error in the question, sorry.
b , d
c , d
x , y
y , y

I feel like all the data which is associate should be in the same executor.
On this case if I order the inputs.
a , b
x , y
b , c
y , y
c , d
--> to
a , b
b , c
c , d
x , y
y , y

Now, a,b ; b,c; one partitions for example, "c,d" and "x,y" another one and
so on.
I could get the relation between "a,b,c", but not about "d" with "a,b,c",
am I wrong? I hope to be wrong!.

It seems that it could be done with GraphX, but as you said, it seems a
little bit overhead.


2016-02-25 5:43 GMT+01:00 James Barney <ja...@gmail.com>:

> Guillermo,
> I think you're after an associative algorithm where A is ultimately
> associated with D, correct? Jakob would correct if that is a typo--a sort
> would be all that is necessary in that case.
>
> I believe you're looking for something else though, if I understand
> correctly.
>
> This seems like a similar algorithm to PageRank, no?
> https://github.com/amplab/graphx/blob/master/python/examples/pagerank.py
> Except return the "neighbor" itself, not the necessarily the rank of the
> page.
>
> If you wanted to, use Scala and Graphx for this problem. Might be a bit of
> overhead though: Construct a node for each member of each tuple with an
> edge between. Then traverse the graph for all sets of nodes that are
> connected. That result set would quickly explode in size, but you could
> restrict results to a minimum N connections. I'm not super familiar with
> Graphx myself, however. My intuition is saying 'graph problem' though.
>
> Thoughts?
>
>
> On Wed, Feb 24, 2016 at 6:43 PM, Jakob Odersky <ja...@odersky.com> wrote:
>
>> Hi Guillermo,
>> assuming that the first "a,b" is a typo and you actually meant "a,d",
>> this is a sorting problem.
>>
>> You could easily model your data as an RDD or tuples (or as a
>> dataframe/set) and use the sortBy (or orderBy for dataframe/sets)
>> methods.
>>
>> best,
>> --Jakob
>>
>> On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz <ko...@gmail.com>
>> wrote:
>> > I want to do some algorithm in Spark.. I know how to do it in a single
>> > machine where all data are together, but I don't know a good way to do
>> it in
>> > Spark.
>> >
>> > If someone has an idea..
>> > I have some data like this
>> > a , b
>> > x , y
>> > b , c
>> > y , y
>> > c , d
>> >
>> > I want something like:
>> > a , d
>> > b , d
>> > c , d
>> > x , y
>> > y , y
>> >
>> > I need to know that a->b->c->d, so a->d, b->d and c->d.
>> > I don't want the code, just an idea how I could deal with it.
>> >
>> > Any idea?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: How could I do this algorithm in Spark?

Posted by James Barney <ja...@gmail.com>.

Guillermo,
I think you're after an associative algorithm where A is ultimately
associated with D, correct? Jakob would correct if that is a typo--a sort
would be all that is necessary in that case.

I believe you're looking for something else though, if I understand
correctly.

This seems like a similar algorithm to PageRank, no?
https://github.com/amplab/graphx/blob/master/python/examples/pagerank.py
Except return the "neighbor" itself, not the necessarily the rank of the
page.

If you wanted to, use Scala and Graphx for this problem. Might be a bit of
overhead though: Construct a node for each member of each tuple with an
edge between. Then traverse the graph for all sets of nodes that are
connected. That result set would quickly explode in size, but you could
restrict results to a minimum N connections. I'm not super familiar with
Graphx myself, however. My intuition is saying 'graph problem' though.

Thoughts?

On Wed, Feb 24, 2016 at 6:43 PM, Jakob Odersky <ja...@odersky.com> wrote:

> Hi Guillermo,
> assuming that the first "a,b" is a typo and you actually meant "a,d",
> this is a sorting problem.
>
> You could easily model your data as an RDD or tuples (or as a
> dataframe/set) and use the sortBy (or orderBy for dataframe/sets)
> methods.
>
> best,
> --Jakob
>
> On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz <ko...@gmail.com>
> wrote:
> > I want to do some algorithm in Spark.. I know how to do it in a single
> > machine where all data are together, but I don't know a good way to do
> it in
> > Spark.
> >
> > If someone has an idea..
> > I have some data like this
> > a , b
> > x , y
> > b , c
> > y , y
> > c , d
> >
> > I want something like:
> > a , d
> > b , d
> > c , d
> > x , y
> > y , y
> >
> > I need to know that a->b->c->d, so a->d, b->d and c->d.
> > I don't want the code, just an idea how I could deal with it.
> >
> > Any idea?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: How could I do this algorithm in Spark?

Posted by Jakob Odersky <ja...@odersky.com>.

Hi Guillermo,
assuming that the first "a,b" is a typo and you actually meant "a,d",
this is a sorting problem.

You could easily model your data as an RDD or tuples (or as a
dataframe/set) and use the sortBy (or orderBy for dataframe/sets)
methods.

best,
--Jakob

On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz <ko...@gmail.com> wrote:
> I want to do some algorithm in Spark.. I know how to do it in a single
> machine where all data are together, but I don't know a good way to do it in
> Spark.
>
> If someone has an idea..
> I have some data like this
> a , b
> x , y
> b , c
> y , y
> c , d
>
> I want something like:
> a , d
> b , d
> c , d
> x , y
> y , y
>
> I need to know that a->b->c->d, so a->d, b->d and c->d.
> I don't want the code, just an idea how I could deal with it.
>
> Any idea?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org