You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by love2dishtech <lo...@gmail.com> on 2014/04/08 22:02:27 UTC

Apache Spark and Graphx for Real Time Analytics

Hi,

Is Graphx on top of Apache Spark, is able to process the large scale
distributed graph traversal and compute, in real time. What is the query
execution engine distributing the query on top of graphx and apache spark.
My typical use case is a large scale distributed graph traversal in real
time, with billions of nodes.

Thanks,
Love.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-and-Graphx-for-Real-Time-Analytics-tp6261.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Apache Spark and Graphx for Real Time Analytics

Posted by Koert Kuipers <ko...@tresata.com>.

it all depends on what kind of traversing. if its point traversing then a
random access based something would be great.

if its more scan-like traversl then spark will fit


On Tue, Apr 8, 2014 at 4:56 PM, Evan Chan <ev...@ooyala.com> wrote:

> I doubt Titan would be able to give you traversal of billions of nodes in
> real-time either.   In-memory traversal is typically much faster than
> Cassandra-based tree traversal, even including in-memory caching.
>
>
> On Tue, Apr 8, 2014 at 1:23 PM, Nick Pentreath <nick.pentreath@gmail.com
> >wrote:
>
> > GraphX, like Spark, will not typically be "real-time" (where by
> "real-time"
> > here I assume you mean of the order of a few 10s-100s ms, up to a few
> > seconds).
> >
> > Spark can in some cases approach the upper boundary of this definition (a
> > second or two, possibly less) when data is cached in memory and the
> > computation is not "too heavy", while Spark Streaming may be able to get
> > closer to the mid-to-upper boundary of this under similar conditions,
> > especially if aggregating over relatively small windows.
> >
> > However, for this use case (while I haven't used GraphX yet) I would say
> > something like Titan (https://github.com/thinkaurelius/titan/wiki) or a
> > similar OLTP graph DB may be what you're after. But this depends on what
> > kind of graph traversal you need.
> >
> >
> >
> >
> > On Tue, Apr 8, 2014 at 10:02 PM, love2dishtech <love2dishtech@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > Is Graphx on top of Apache Spark, is able to process the large scale
> > > distributed graph traversal and compute, in real time. What is the
> query
> > > execution engine distributing the query on top of graphx and apache
> > spark.
> > > My typical use case is a large scale distributed graph traversal in
> real
> > > time, with billions of nodes.
> > >
> > > Thanks,
> > > Love.
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-and-Graphx-for-Real-Time-Analytics-tp6261.html
> > > Sent from the Apache Spark Developers List mailing list archive at
> > > Nabble.com.
> > >
> >
>
>
>
> --
> --
> Evan Chan
> Staff Engineer
> ev@ooyala.com  |
>
> <http://www.ooyala.com/>
> <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><
> http://www.twitter.com/ooyala>
>

Re: Apache Spark and Graphx for Real Time Analytics

Posted by Reynold Xin <rx...@databricks.com>.

Nick and Koert summarized it pretty well. Just to clarify and give some
concrete examples.

If you want to start with a specific vertex, and follow some path, it is
probably easier and faster to use some key values store or even MySQL or a
graph database.

If you want to count the average length of paths between all nodes, or if
you want to compute the pair wise shortest path for all vertices, GraphX
will likely be way faster.






On Tue, Apr 8, 2014 at 2:03 PM, Nick Pentreath <ni...@gmail.com>wrote:

> Likely neither will give real-time for full-graph traversal, no. And once
> in memory, GraphX would definitely be faster for "breadth-first" traversal.
>
> But for "vertex-centric" traversals (starting from a vertex and traversing
> edges from there, such as "friends of friends" queries etc) then Titan is
> optimized for that use case.
>
>
>
>
> On Tue, Apr 8, 2014 at 10:56 PM, Evan Chan <ev...@ooyala.com> wrote:
>
> > I doubt Titan would be able to give you traversal of billions of nodes in
> > real-time either.   In-memory traversal is typically much faster than
> > Cassandra-based tree traversal, even including in-memory caching.
> >
> >
> > On Tue, Apr 8, 2014 at 1:23 PM, Nick Pentreath <nick.pentreath@gmail.com
> > >wrote:
> >
> > > GraphX, like Spark, will not typically be "real-time" (where by
> > "real-time"
> > > here I assume you mean of the order of a few 10s-100s ms, up to a few
> > > seconds).
> > >
> > > Spark can in some cases approach the upper boundary of this definition
> (a
> > > second or two, possibly less) when data is cached in memory and the
> > > computation is not "too heavy", while Spark Streaming may be able to
> get
> > > closer to the mid-to-upper boundary of this under similar conditions,
> > > especially if aggregating over relatively small windows.
> > >
> > > However, for this use case (while I haven't used GraphX yet) I would
> say
> > > something like Titan (https://github.com/thinkaurelius/titan/wiki) or
> a
> > > similar OLTP graph DB may be what you're after. But this depends on
> what
> > > kind of graph traversal you need.
> > >
> > >
> > >
> > >
> > > On Tue, Apr 8, 2014 at 10:02 PM, love2dishtech <
> love2dishtech@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > Is Graphx on top of Apache Spark, is able to process the large scale
> > > > distributed graph traversal and compute, in real time. What is the
> > query
> > > > execution engine distributing the query on top of graphx and apache
> > > spark.
> > > > My typical use case is a large scale distributed graph traversal in
> > real
> > > > time, with billions of nodes.
> > > >
> > > > Thanks,
> > > > Love.
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-and-Graphx-for-Real-Time-Analytics-tp6261.html
> > > > Sent from the Apache Spark Developers List mailing list archive at
> > > > Nabble.com.
> > > >
> > >
> >
> >
> >
> > --
> > --
> > Evan Chan
> > Staff Engineer
> > ev@ooyala.com  |
> >
> > <http://www.ooyala.com/>
> > <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala
> ><
> > http://www.twitter.com/ooyala>
> >
>

Re: Apache Spark and Graphx for Real Time Analytics

Posted by Nick Pentreath <ni...@gmail.com>.

Likely neither will give real-time for full-graph traversal, no. And once
in memory, GraphX would definitely be faster for "breadth-first" traversal.

But for "vertex-centric" traversals (starting from a vertex and traversing
edges from there, such as "friends of friends" queries etc) then Titan is
optimized for that use case.




On Tue, Apr 8, 2014 at 10:56 PM, Evan Chan <ev...@ooyala.com> wrote:

> I doubt Titan would be able to give you traversal of billions of nodes in
> real-time either.   In-memory traversal is typically much faster than
> Cassandra-based tree traversal, even including in-memory caching.
>
>
> On Tue, Apr 8, 2014 at 1:23 PM, Nick Pentreath <nick.pentreath@gmail.com
> >wrote:
>
> > GraphX, like Spark, will not typically be "real-time" (where by
> "real-time"
> > here I assume you mean of the order of a few 10s-100s ms, up to a few
> > seconds).
> >
> > Spark can in some cases approach the upper boundary of this definition (a
> > second or two, possibly less) when data is cached in memory and the
> > computation is not "too heavy", while Spark Streaming may be able to get
> > closer to the mid-to-upper boundary of this under similar conditions,
> > especially if aggregating over relatively small windows.
> >
> > However, for this use case (while I haven't used GraphX yet) I would say
> > something like Titan (https://github.com/thinkaurelius/titan/wiki) or a
> > similar OLTP graph DB may be what you're after. But this depends on what
> > kind of graph traversal you need.
> >
> >
> >
> >
> > On Tue, Apr 8, 2014 at 10:02 PM, love2dishtech <love2dishtech@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > Is Graphx on top of Apache Spark, is able to process the large scale
> > > distributed graph traversal and compute, in real time. What is the
> query
> > > execution engine distributing the query on top of graphx and apache
> > spark.
> > > My typical use case is a large scale distributed graph traversal in
> real
> > > time, with billions of nodes.
> > >
> > > Thanks,
> > > Love.
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-and-Graphx-for-Real-Time-Analytics-tp6261.html
> > > Sent from the Apache Spark Developers List mailing list archive at
> > > Nabble.com.
> > >
> >
>
>
>
> --
> --
> Evan Chan
> Staff Engineer
> ev@ooyala.com  |
>
> <http://www.ooyala.com/>
> <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><
> http://www.twitter.com/ooyala>
>

Re: Apache Spark and Graphx for Real Time Analytics

Posted by Evan Chan <ev...@ooyala.com>.

I doubt Titan would be able to give you traversal of billions of nodes in
real-time either.   In-memory traversal is typically much faster than
Cassandra-based tree traversal, even including in-memory caching.


On Tue, Apr 8, 2014 at 1:23 PM, Nick Pentreath <ni...@gmail.com>wrote:

> GraphX, like Spark, will not typically be "real-time" (where by "real-time"
> here I assume you mean of the order of a few 10s-100s ms, up to a few
> seconds).
>
> Spark can in some cases approach the upper boundary of this definition (a
> second or two, possibly less) when data is cached in memory and the
> computation is not "too heavy", while Spark Streaming may be able to get
> closer to the mid-to-upper boundary of this under similar conditions,
> especially if aggregating over relatively small windows.
>
> However, for this use case (while I haven't used GraphX yet) I would say
> something like Titan (https://github.com/thinkaurelius/titan/wiki) or a
> similar OLTP graph DB may be what you're after. But this depends on what
> kind of graph traversal you need.
>
>
>
>
> On Tue, Apr 8, 2014 at 10:02 PM, love2dishtech <love2dishtech@gmail.com
> >wrote:
>
> > Hi,
> >
> > Is Graphx on top of Apache Spark, is able to process the large scale
> > distributed graph traversal and compute, in real time. What is the query
> > execution engine distributing the query on top of graphx and apache
> spark.
> > My typical use case is a large scale distributed graph traversal in real
> > time, with billions of nodes.
> >
> > Thanks,
> > Love.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-and-Graphx-for-Real-Time-Analytics-tp6261.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
>



-- 
--
Evan Chan
Staff Engineer
ev@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: Apache Spark and Graphx for Real Time Analytics

Posted by Nick Pentreath <ni...@gmail.com>.

GraphX, like Spark, will not typically be "real-time" (where by "real-time"
here I assume you mean of the order of a few 10s-100s ms, up to a few
seconds).

Spark can in some cases approach the upper boundary of this definition (a
second or two, possibly less) when data is cached in memory and the
computation is not "too heavy", while Spark Streaming may be able to get
closer to the mid-to-upper boundary of this under similar conditions,
especially if aggregating over relatively small windows.

However, for this use case (while I haven't used GraphX yet) I would say
something like Titan (https://github.com/thinkaurelius/titan/wiki) or a
similar OLTP graph DB may be what you're after. But this depends on what
kind of graph traversal you need.

On Tue, Apr 8, 2014 at 10:02 PM, love2dishtech <lo...@gmail.com>wrote:

> Hi,
>
> Is Graphx on top of Apache Spark, is able to process the large scale
> distributed graph traversal and compute, in real time. What is the query
> execution engine distributing the query on top of graphx and apache spark.
> My typical use case is a large scale distributed graph traversal in real
> time, with billions of nodes.
>
> Thanks,
> Love.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-and-Graphx-for-Real-Time-Analytics-tp6261.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>