You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by MengXiaodong <me...@gmail.com> on 2015/03/11 06:37:05 UTC

How to format Giraph input dataset

Hi all,

I'm new to Giraph, now I successfully ran my first example by following the instruction on Giraph - Quick Start. However, I met a question when I write my own Giraph code.

In the "quick start", The format of input graph is as following:

[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]

But the graphs (like Facebook, twitter social network) datasets downloaded from public websites are in various format. How can I transform a graph into the standard Giraph graph like the above one?

For example the WikiTalk graph as blow, which is a directed graph. Directed edge A->B means user A edited talk page of B.

# FromNodeId	ToNodeId
0	1
2	1
2	21
2	46
2	63
2	88
2	93
2	94
2	101
2	102
2	103
2	116
2	119
2	125

Regards,
Ralph

Re: How to format Giraph input dataset

Posted by Steven Harenberg <sd...@ncsu.edu>.
Hi Ralph,

I also wanted to use edge-list input format as well since I am running
examples from SNAP. I ran into a lot of issues and at this point if I could
go back in time I would probably just make a script to convert the graphs
into giraphs standard format.

To deal with the type of errors you had above, I created my own class files:

   - LongFloatTextEdgeInputFormat.java (for pagerank)
   - LongNullTextEdgeInputFormat.java
   - LongNullReverseTextEdgeInputFormat.java (for undirected)
   - LongPair (used inside the above classes)

Basically, these just were the same as their corresponding int class file.

However, the main issue with edgelist input files, there is a fundamental
issue with SSSP (and I believe pagerank) when using an edgelist input
format. If a vertex is not ever listed first in an edge (e.g., it only has
incoming edges), it will not be "active" in superstep 0. This means it will
not be initialized with the correct value (
http://mail-archives.apache.org/mod_mbox/giraph-user/201502.mbox/%3CCAHv2Baw7zFJ-s7dtNMv5dkNxz_zE436krE%2B6G4r3tp-HVgjW2g%40mail.gmail.com%3E
).

On Thu, Mar 12, 2015 at 11:04 AM, MengXiaodong <me...@gmail.com>
wrote:

> Hi Martin,
>
> Thank you for your kindly reply. I followed your suggestion and input the
> command like blow:
>
> *hadoop
> jar giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner
> org.apache.giraph.examples.SimpleShortestPathsComputation
> -eif org.apache.giraph.io.formats.IntNullTextEdgeInputFormat -eip
> /WikiTalk.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat
> -op /outputTran -w 1*
>
> However, I got a error when I try this common:
> *Exception in thread "main" java.lang.IllegalArgumentException:
> checkClassTypes: vertex index types not assignable, computation - class
> org.apache.hadoop.io.LongWritable, EdgeInputFormat - class
> org.apache.hadoop.io.NullWritable*
> * at
> org.apache.giraph.job.GiraphConfigurationValidator.checkAssignable(GiraphConfigurationValidator.java:384)*
> * at
> org.apache.giraph.job.GiraphConfigurationValidator.verifyEdgeInputFormatGenericTypes(GiraphConfigurationValidator.java:242)*
> * at
> org.apache.giraph.job.GiraphConfigurationValidator.validateConfiguration(GiraphConfigurationValidator.java:142)*
> * at
> org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:222)*
> * at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:74)*
> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)*
> * at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124)*
> * at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*
> * at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)*
> * at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
> * at java.lang.reflect.Method.invoke(Method.java:483)*
> * at org.apache.hadoop.util.RunJar.main(RunJar.java:156)*
>
>
>
> I assume that the error happens because the input format is intwritable
> while the example uses longwritable as the vertex id. If so, may I ask how
> to transfer intwritable to longwritable?
>
> Kindly Regards,
> Ralph
>
> On Mar 11, 2015, at 4:02 PM, Martin Junghanns <ma...@gmx.net>
> wrote:
>
> Hi Ralph,
>
> you can set a vertex or edge input format when running a Giraph job.
> In the example, you used the vertex input format (vif)
>
> "-vif
> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat"
>
> Your wikitalk input format is an edge list and Giraph offers, e.g.,
>
> "org.apache.giraph.io.formats.IntNullTextEdgeInputFormat"
>
> which reads a graph where "Each line consists of: source_vertex,
> target_vertex" (separated by a \t)
>
> You can set the edge input format via the -eif parameter.
>
> Cheers,
> Martin
>
> The package "org.apache.giraph.io.formats" in giraph-core contains a lot
> more formats.
>
> On 11.03.2015 06:37, MengXiaodong wrote:
>
> Hi all,
>
> I'm new to Giraph, now I successfully ran my first example by
> following the instruction on Giraph - Quick Start. However, I met a
> question when I write my own Giraph code.
>
> In the "quick start", The format of input graph is as following:
>
> [0,0,[[1,1],[3,3]]] [1,0,[[0,1],[2,2],[3,1]]] [2,0,[[1,2],[4,4]]]
> [3,0,[[0,3],[1,1],[4,4]]] [4,0,[[3,4],[2,4]]]
>
> But the graphs (like Facebook, twitter social network) datasets
> downloaded from public websites are in various format. How can I
> transform a graph into the standard Giraph graph like the above
> one?
>
> For example the WikiTalk graph as blow, which is a directed graph.
> Directed edge A->B means user A edited talk page of B.
>
> # FromNodeId ToNodeId 0 1 2 1 2 21 2 46 2 63 2 88 2 93 2 94 2 101 2
> 102 2 103 2 116 2 119 2 125
>
> Regards, Ralph
>
>
>

Re: How to format Giraph input dataset

Posted by MengXiaodong <me...@gmail.com>.
Hi Martin,

Thank you for your kindly reply. I followed your suggestion and input the command like blow:

hadoop jar giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation -eif org.apache.giraph.io.formats.IntNullTextEdgeInputFormat -eip /WikiTalk.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /outputTran -w 1

However, I got a error when I try this common:
Exception in thread "main" java.lang.IllegalArgumentException: checkClassTypes: vertex index types not assignable, computation - class org.apache.hadoop.io.LongWritable, EdgeInputFormat - class org.apache.hadoop.io.NullWritable
	at org.apache.giraph.job.GiraphConfigurationValidator.checkAssignable(GiraphConfigurationValidator.java:384)
	at org.apache.giraph.job.GiraphConfigurationValidator.verifyEdgeInputFormatGenericTypes(GiraphConfigurationValidator.java:242)
	at org.apache.giraph.job.GiraphConfigurationValidator.validateConfiguration(GiraphConfigurationValidator.java:142)
	at org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:222)
	at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:74)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
	at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



I assume that the error happens because the input format is intwritable while the example uses longwritable as the vertex id. If so, may I ask how to transfer intwritable to longwritable?

Kindly Regards,
Ralph

> On Mar 11, 2015, at 4:02 PM, Martin Junghanns <ma...@gmx.net> wrote:
> 
> Hi Ralph,
> 
> you can set a vertex or edge input format when running a Giraph job.
> In the example, you used the vertex input format (vif)
> 
> "-vif
> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat"
> 
> Your wikitalk input format is an edge list and Giraph offers, e.g.,
> 
> "org.apache.giraph.io.formats.IntNullTextEdgeInputFormat"
> 
> which reads a graph where "Each line consists of: source_vertex,
> target_vertex" (separated by a \t)
> 
> You can set the edge input format via the -eif parameter.
> 
> Cheers,
> Martin
> 
> The package "org.apache.giraph.io.formats" in giraph-core contains a lot
> more formats.
> 
> On 11.03.2015 06:37, MengXiaodong wrote:
>> Hi all,
>> 
>> I'm new to Giraph, now I successfully ran my first example by
>> following the instruction on Giraph - Quick Start. However, I met a
>> question when I write my own Giraph code.
>> 
>> In the "quick start", The format of input graph is as following:
>> 
>> [0,0,[[1,1],[3,3]]] [1,0,[[0,1],[2,2],[3,1]]] [2,0,[[1,2],[4,4]]] 
>> [3,0,[[0,3],[1,1],[4,4]]] [4,0,[[3,4],[2,4]]]
>> 
>> But the graphs (like Facebook, twitter social network) datasets
>> downloaded from public websites are in various format. How can I
>> transform a graph into the standard Giraph graph like the above
>> one?
>> 
>> For example the WikiTalk graph as blow, which is a directed graph.
>> Directed edge A->B means user A edited talk page of B.
>> 
>> # FromNodeId	ToNodeId 0	1 2	1 2	21 2	46 2	63 2	88 2	93 2	94 2	101 2
>> 102 2	103 2	116 2	119 2	125
>> 
>> Regards, Ralph
>> 


Re: How to format Giraph input dataset

Posted by Martin Junghanns <ma...@gmx.net>.
Hi Ralph,

you can set a vertex or edge input format when running a Giraph job.
In the example, you used the vertex input format (vif)

"-vif
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat"

Your wikitalk input format is an edge list and Giraph offers, e.g.,

"org.apache.giraph.io.formats.IntNullTextEdgeInputFormat"

which reads a graph where "Each line consists of: source_vertex,
target_vertex" (separated by a \t)

You can set the edge input format via the -eif parameter.

Cheers,
Martin

The package "org.apache.giraph.io.formats" in giraph-core contains a lot
more formats.

On 11.03.2015 06:37, MengXiaodong wrote:
> Hi all,
> 
> I'm new to Giraph, now I successfully ran my first example by
> following the instruction on Giraph - Quick Start. However, I met a
> question when I write my own Giraph code.
> 
> In the "quick start", The format of input graph is as following:
> 
> [0,0,[[1,1],[3,3]]] [1,0,[[0,1],[2,2],[3,1]]] [2,0,[[1,2],[4,4]]] 
> [3,0,[[0,3],[1,1],[4,4]]] [4,0,[[3,4],[2,4]]]
> 
> But the graphs (like Facebook, twitter social network) datasets
> downloaded from public websites are in various format. How can I
> transform a graph into the standard Giraph graph like the above
> one?
> 
> For example the WikiTalk graph as blow, which is a directed graph.
> Directed edge A->B means user A edited talk page of B.
> 
> # FromNodeId	ToNodeId 0	1 2	1 2	21 2	46 2	63 2	88 2	93 2	94 2	101 2
> 102 2	103 2	116 2	119 2	125
> 
> Regards, Ralph
>