You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@giraph.apache.org by Mirko Kämpf <mi...@cloudera.com> on 2014/06/11 14:26:00 UTC

Re: Introducing Graft: A debugging and testing tool for Giraph algorithms

Hi,

some time ago I was starting work on visualization of graph data, stored in
Hadoop via Gephi. A first draft of results is here in this blog post:
http://blog.cloudera.com/blog/2014/05/how-to-manage-time-dependent-multilayer-networks-in-apache-hadoop/
We found, to handle the metadata for graphs and the appropriate
input-converters was the major problem which had to be solved. Now it is
easy to retrieve edge and node lists, even for time dependent graphs. The
current solution works with Hive or Impala to retrieve the data via JDBC.

But I think, it would be great to have an API in Giraph which allows to
trigger a snapshot of the current state of a graph which is processed.
After such a snapshot is done the external tool loads this data, e.g. into
Gephi. Maybe in a second step, we can just load the data from all worker
nodes directly instead of HDFS, but for the beginning it would be fine to
use HDFS to decouple the processing layer and the gui.

In case of really large graphs, I think a Java-Applet using the
"gephi-tools" project could do a great job to render a large graph.

The snapshot could be triggered via Zookeeper. A job registers its ability
to receive such an optional request. And via Zookeeper a client can find
all graphs to look into (based on such a snapshot) and than sends this
request. In the next superstep the job looks for the snapshot status in
Zookeeper, creates one or just precedes and so on. This would even allow to
export time dependent intermediate results from running graph algorithms
without a new start.

What do you think about such a feature? I think it is also related to the
"graph centric API", propsed a while ago.
Is it worth a JIRA and do you see use cases for this feature?

Best wishes,
Mirko

Re: Introducing Graft: A debugging and testing tool for Giraph algorithms

Posted by Mirko Kämpf <mi...@cloudera.com>.

Hi Claudio,
thanks for your comment. And I agree, in case of really large graphs it
might be too much for Gephi or even any other tool to visualize. So I think
a bout a large picture in which a "disk based state" of a graph can be used
to filter, search, or just select subgraphs, e.g starting with a search
results we get a set of nodes and than we just select the next or second or
n-th neighborhood). Node list can be filtered or partitioned base on node
properties, e.g. natural properties like gender, region, or what ever
semantic annotation one works with or based on graph properties (rank,
centrality and so on). The Gephi connector allows such "on demand"
filtering. Using Giraph snapshots, the behavior is already available - and
for the long term such an interactive on demand snapshot allows interesting
use cases - especially in order to study the behavior of large scale graph
growth models in which phase transitions and e.g. changing growth rates are
in the focus.
I will describe the idea more detailed next week and file a Jira.

Cheers,
Mirko



On Wed, Jun 11, 2014 at 2:14 PM, Claudio Martella <
claudio.martella@gmail.com> wrote:

> In general, I think this is a cool idea which could be prototyped quickly,
> e.g. by leveraging the existing features like Giraph snapshots. This allows
> us to write to HDFS the state of the computation (e.g. the vertex values,
> but also messages) during the computation which we can read with the input
> graph (if we assume a static graph). My concern is that if you use Giraph
> you probably have a graph that is really large, even for Gephi. So I feel
> one of the question around such a tool would be how to sample the data
> effectively.
>
> Anyway, I welcome Graft. Tools should really be at the top of our priority
> at this point with Giraph. On these topic, do you guys at Facebook (read
> @Avery) plan to release your visualizer anytime soon? I've seen the slides
> for your presentation at IWGDM, and there's a slide about a visualizer
> GiraphicJam there (slide number 27).
>
>
> On Wed, Jun 11, 2014 at 2:26 PM, Mirko Kämpf <mi...@cloudera.com>
> wrote:
>
> > Hi,
> >
> > some time ago I was starting work on visualization of graph data, stored
> in
> > Hadoop via Gephi. A first draft of results is here in this blog post:
> >
> >
> http://blog.cloudera.com/blog/2014/05/how-to-manage-time-dependent-multilayer-networks-in-apache-hadoop/
> > We found, to handle the metadata for graphs and the appropriate
> > input-converters was the major problem which had to be solved. Now it is
> > easy to retrieve edge and node lists, even for time dependent graphs. The
> > current solution works with Hive or Impala to retrieve the data via JDBC.
> >
> > But I think, it would be great to have an API in Giraph which allows to
> > trigger a snapshot of the current state of a graph which is processed.
> > After such a snapshot is done the external tool loads this data, e.g.
> into
> > Gephi. Maybe in a second step, we can just load the data from all worker
> > nodes directly instead of HDFS, but for the beginning it would be fine to
> > use HDFS to decouple the processing layer and the gui.
> >
> > In case of really large graphs, I think a Java-Applet using the
> > "gephi-tools" project could do a great job to render a large graph.
> >
> > The snapshot could be triggered via Zookeeper. A job registers its
> ability
> > to receive such an optional request. And via Zookeeper a client can find
> > all graphs to look into (based on such a snapshot) and than sends this
> > request. In the next superstep the job looks for the snapshot status in
> > Zookeeper, creates one or just precedes and so on. This would even allow
> to
> > export time dependent intermediate results from running graph algorithms
> > without a new start.
> >
> > What do you think about such a feature? I think it is also related to the
> > "graph centric API", propsed a while ago.
> > Is it worth a JIRA and do you see use cases for this feature?
> >
> > Best wishes,
> > Mirko
> >
>
>
>
> --
>    Claudio Martella
>

Re: Introducing Graft: A debugging and testing tool for Giraph algorithms

Posted by Claudio Martella <cl...@gmail.com>.

In general, I think this is a cool idea which could be prototyped quickly,
e.g. by leveraging the existing features like Giraph snapshots. This allows
us to write to HDFS the state of the computation (e.g. the vertex values,
but also messages) during the computation which we can read with the input
graph (if we assume a static graph). My concern is that if you use Giraph
you probably have a graph that is really large, even for Gephi. So I feel
one of the question around such a tool would be how to sample the data
effectively.

Anyway, I welcome Graft. Tools should really be at the top of our priority
at this point with Giraph. On these topic, do you guys at Facebook (read
@Avery) plan to release your visualizer anytime soon? I've seen the slides
for your presentation at IWGDM, and there's a slide about a visualizer
GiraphicJam there (slide number 27).

On Wed, Jun 11, 2014 at 2:26 PM, Mirko Kämpf <mi...@cloudera.com>
wrote:

> Hi,
>
> some time ago I was starting work on visualization of graph data, stored in
> Hadoop via Gephi. A first draft of results is here in this blog post:
>
> http://blog.cloudera.com/blog/2014/05/how-to-manage-time-dependent-multilayer-networks-in-apache-hadoop/
> We found, to handle the metadata for graphs and the appropriate
> input-converters was the major problem which had to be solved. Now it is
> easy to retrieve edge and node lists, even for time dependent graphs. The
> current solution works with Hive or Impala to retrieve the data via JDBC.
>
> But I think, it would be great to have an API in Giraph which allows to
> trigger a snapshot of the current state of a graph which is processed.
> After such a snapshot is done the external tool loads this data, e.g. into
> Gephi. Maybe in a second step, we can just load the data from all worker
> nodes directly instead of HDFS, but for the beginning it would be fine to
> use HDFS to decouple the processing layer and the gui.
>
> In case of really large graphs, I think a Java-Applet using the
> "gephi-tools" project could do a great job to render a large graph.
>
> The snapshot could be triggered via Zookeeper. A job registers its ability
> to receive such an optional request. And via Zookeeper a client can find
> all graphs to look into (based on such a snapshot) and than sends this
> request. In the next superstep the job looks for the snapshot status in
> Zookeeper, creates one or just precedes and so on. This would even allow to
> export time dependent intermediate results from running graph algorithms
> without a new start.
>
> What do you think about such a feature? I think it is also related to the
> "graph centric API", propsed a while ago.
> Is it worth a JIRA and do you see use cases for this feature?
>
> Best wishes,
> Mirko
>

-- 
   Claudio Martella