You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Kenrick Fernandes <ke...@gmail.com> on 2015/04/25 23:58:54 UTC

Input format problems running Giraph 1.1.0 on Twitter dataset

Hello,

Im trying to get Giraph to read the Twitter dataset as input for the
SimplePageRankComputation program. The dataset format looks like this:
61578010 61147436
61578037 61147436
61578040 61147436
(vertex id's, with pairs representing edges)

When I run the command with
*-vif org.apache.giraph.io.formats.IntIntNullTextInputFormat*, I get this
error :
*java.lang.IllegalArgumentException: checkClassTypes: vertex index types
not assignable, computation - class org.apache.hadoop.io.LongWritable,
VertexInputFormat - class org.apache.hadoop.io.IntWritable*

So I tried running the command with
*-vif org.apache.giraph.io.formats.LongLongNullTextInputFormat* and I get a
different one:
*java.lang.IllegalArgumentException: checkClassTypes: vertex value types
not assignable, computation - class org.apache.hadoop.io.DoubleWritable,
VertexInputFormat - class org.apache.hadoop.io.LongWritable*

I dont understand why the types in the input show up as different formats
in each error. Also, as far as I could tell, there is no input format for
DoubleDouble. Is there a different way to get the graph into Giraph without
having to write custom input code ? Thoughts would be much appreciated.

-----
Reference Command:
*hadoop jar
giraph-examples-1.1.0-for-hadoop-1.1.2-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner
org.apache.giraph.examples.PageRankComputation -vif
org.apache.giraph.io.formats.LongLongNullTextInputFormat -vip
/user/kenrick/twitter/input -op /user/kenrick/twitter/output -w 30*
-----

Thanks,
Kenrick

Re: Input format problems running Giraph 1.1.0 on Twitter dataset

Posted by Young Han <yo...@uwaterloo.ca>.
Hmm.. you might have an off-by-one error in your MasterCompute. The
superstep counter is -1 during input loading and starts at 0 for the first
iteration of computation. Assuming things haven't changed since 1.1.0-RC0,
MasterCompute executes after the end of a superstep (after the global
barrier) but before the start of a new superstep. However, the tricky bit
is that it also runs after the input superstep (superstep -1). So what you
might be seeing is the # of vertices after SS -1 (incorrect), after SS 0
(still incorrect), and after SS 1 (now correct).

What Steven said regarding vertex addition is correct. Internally, when
there is a message for a vertex that doesn't exist, Giraph will (by
default) add that vertex via a vertex mutation. These mutations are all
performed during a global barrier (i.e., between SS 0 and SS 1). So for
SimplePageRankComputation, you have all vertices broadcasting to their
out-edge neighbours during SS 0. This means all missing vertices receive
messages and so they get added after SS 0 but before SS 1. In SS 1, you
will observe that all vertices without out-edge neighbours are now added.
The VertexValueFactory solution works because it is called by Giraph when
creating/adding these missing vertices.

Peering into the internals, I believe the order of execution is: end of
superstep reached -> workers flush all messages -> workers perform graph
mutations -> all workers arrive at the global barrier -> master compute
executes -> workers begin new superstep. (And input loading is a special
case: input loading/partition exchange -> all workers arrive at the global
barrier -> master compute executes -> workers begin superstep 0.)

Young

On Mon, May 4, 2015 at 1:16 PM, Steven Harenberg <sd...@ncsu.edu> wrote:

> My understanding is that a vertex with only incoming edges will not be
> active until it receives a message, which is why you don't see all of the
> vertices initially. The easiest way to test this is to write a script that
> parses your input and creates a new data file where every vertex is
> specified on a line of its own. Even if it has no outgoing neighbors, just
> leave the neighbor empty. Or, first just check if you have
> 40383589-40103281=280308 vertices with only incoming edges.
>
> Young provided another solution for fixing the initialization problem, and
> it looks like in the code that wasn't specified this code to still have the
> problem.
>
> Either transform the input (seems like the easiest thing to do), or try
> the fix Young said. I would bet either of those would fix the issue. Young
> may have better ideas since he seems more experienced with Giraph than I am.
>
> --Steve
>
> On Sat, May 2, 2015 at 2:19 PM, Kenrick Fernandes <ke...@gmail.com>
> wrote:
>
>> Thank you both for your responses.
>>
>> Steve, I faced the same problem when I created the Long input format
>> files.
>> I tried running the code linked by Young above, using the
>> *SimplePageRankInputFormat.java*
>> as well as the *SimplePageRankVertex.java* in the repo.
>>
>> For the Twitter dataset, I added some *MasterCompute* code to log the
>> number of vertices
>> that existed at each superstep. The results, however, look pretty similar
>> to the previous iteration:
>>
>> Current step is 1 - 40103281 existed in the previous superstep 0Current step is 2 - 40103281 existed in the previous superstep 1
>>
>> Current step is 3 - 40383589 existed in the previous superstep 2
>>
>> Current step is 31 - 40383589 existed in the previous superstep 30
>>
>> It seems that a subset of vertices still only become active after the
>> first superstep,
>> despite all vertices being initialized in superstep 0. I cant think of a
>> reason why
>> - thoughts ?
>>
>> Thanks,
>> Kenrick
>>
>>
>>
>> On Wed, Apr 29, 2015 at 2:33 PM, Young Han <yo...@uwaterloo.ca>
>> wrote:
>>
>>> For the initialization issue, you can define a (nested) class that
>>> extends DefaultVertexValueFactory (from org.apache.giraph.factories) and
>>> add
>>> "-Dgiraph.vertexValueFactoryClass=org.apache.giraph.examples.AlgClass\$AlgVertexValueFactory"
>>> after "org.apache.giraph.GiraphRunner" in your hadoop jar command.
>>>
>>> Also, the reason those input formats don't work is because PageRank is
>>> using LongWritable for vertex id and DoubleWritable for vertex value. As
>>> Roman pointed out, you have to have an input class that matches it (even if
>>> the input dataset has no "double" vertex values). An example (for Giraph
>>> 1.0.0) can be found here:
>>> https://github.com/xvz/graph-processing/blob/master/giraph-1.0.0/giraph-examples/src/main/java/org/apache/giraph/examples/SimplePageRankInputFormat.java
>>> and an example command that uses it here:
>>> https://github.com/xvz/graph-processing/blob/master/benchmark/giraph/pagerank.sh#L50
>>>
>>> Young
>>>
>>> On Wed, Apr 29, 2015 at 11:24 AM, Steven Harenberg <sd...@ncsu.edu>
>>> wrote:
>>>
>>>> Hey Kenrick,
>>>>
>>>> First, your commands above are wrong since you are specifying adjacency
>>>> list format with the -vif argument and since I believe *LongLongNullTextInputFormat
>>>> *refers to adjacency list format. However, even with the right
>>>> commands there will be issues and more things you need to do.
>>>>
>>>> I did get it the edgelist input format to work by creating a
>>>> LongNullTextEdgeInputFormat.java file just like the
>>>> giraph-core/src/main/java/org/apache/giraph/io/formats/IntNullTextEdgeInputFormat.java
>>>> file, but with longs instead of ints (this also required creating a
>>>> LongPair class).
>>>>
>>>> However, I would advise against using an edgelist input format in
>>>> Giraph as there are major underlying issues that I never figured out how to
>>>> resolve. Namely, for an edgelist format, Giraph only considers a vertex
>>>> active in the first superstep if it has an outgoing edge. This means that
>>>> vertices with only incoming edges won't be initialized with correct values
>>>> during things like PageRank, SSSP, or WCC and hence will output incorrect
>>>> results. (You can see my previous thread here:
>>>> http://mail-archives.apache.org/mod_mbox/giraph-user/201502.mbox/%3CCAHv2Baw7zFJ-s7dtNMv5dkNxz_zE436krE%2B6G4r3tp-HVgjW2g%40mail.gmail.com%3E
>>>> )
>>>>
>>>> The above issue can be avoided with adjacency list format by specifying
>>>> the vertex with no neighbors. For example, if vertex v has only incoming
>>>> edges, then you make sure there is a line with just v and no neighbors
>>>> listed (
>>>> http://mail-archives.apache.org/mod_mbox/giraph-user/201408.mbox/%3C1409255770206.93691@uiowa.edu%3E
>>>> ).
>>>>
>>>> If you figure out how to resolve the edgelist input issue please let me
>>>> know.
>>>>
>>>> Regards,
>>>> Steve
>>>>
>>>>
>>>> On Sat, Apr 25, 2015 at 9:54 PM, Kenrick Fernandes <
>>>> kenrick.f15@gmail.com> wrote:
>>>>
>>>>> Hi Roman,
>>>>>
>>>>> Thanks for the quick response. There is no vertex data in this
>>>>> dataset though, and the vertex IDs posted above would fit in a
>>>>> Long. Would you advise changing the PageRankComputation
>>>>> formats, or working on a new input format ?
>>>>>
>>>>> Thanks,
>>>>> Kenrick
>>>>>
>>>>> On Sat, Apr 25, 2015 at 7:40 PM, Roman Shaposhnik <
>>>>> roman@shaposhnik.org> wrote:
>>>>>
>>>>>> One of the slightly annoying things in Giraph is that you have
>>>>>> to manually match your input format to your computation. In
>>>>>> your case, PageRankComputation requires LongWritable for
>>>>>> vertex ID and DoubleWritable for vertex Data. You may need
>>>>>> to hack one of the existing formats slightly.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Roman.
>>>>>>
>>>>>> On Sat, Apr 25, 2015 at 2:58 PM, Kenrick Fernandes
>>>>>> <ke...@gmail.com> wrote:
>>>>>> > Hello,
>>>>>> >
>>>>>> > Im trying to get Giraph to read the Twitter dataset as input for the
>>>>>> > SimplePageRankComputation program. The dataset format looks like
>>>>>> this:
>>>>>> > 61578010 61147436
>>>>>> > 61578037 61147436
>>>>>> > 61578040 61147436
>>>>>> > (vertex id's, with pairs representing edges)
>>>>>> >
>>>>>> > When I run the command with
>>>>>> > -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat, I get
>>>>>> this
>>>>>> > error :
>>>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex index
>>>>>> types not
>>>>>> > assignable, computation - class org.apache.hadoop.io.LongWritable,
>>>>>> > VertexInputFormat - class org.apache.hadoop.io.IntWritable
>>>>>> >
>>>>>> > So I tried running the command with
>>>>>> > -vif org.apache.giraph.io.formats.LongLongNullTextInputFormat and I
>>>>>> get a
>>>>>> > different one:
>>>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex value
>>>>>> types not
>>>>>> > assignable, computation - class org.apache.hadoop.io.DoubleWritable,
>>>>>> > VertexInputFormat - class org.apache.hadoop.io.LongWritable
>>>>>> >
>>>>>> > I dont understand why the types in the input show up as different
>>>>>> formats in
>>>>>> > each error. Also, as far as I could tell, there is no input format
>>>>>> for
>>>>>> > DoubleDouble. Is there a different way to get the graph into Giraph
>>>>>> without
>>>>>> > having to write custom input code ? Thoughts would be much
>>>>>> appreciated.
>>>>>> >
>>>>>> > -----
>>>>>> > Reference Command:
>>>>>> > hadoop jar
>>>>>> giraph-examples-1.1.0-for-hadoop-1.1.2-jar-with-dependencies.jar
>>>>>> > org.apache.giraph.GiraphRunner
>>>>>> > org.apache.giraph.examples.PageRankComputation -vif
>>>>>> > org.apache.giraph.io.formats.LongLongNullTextInputFormat -vip
>>>>>> > /user/kenrick/twitter/input -op /user/kenrick/twitter/output -w 30
>>>>>> > -----
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Kenrick
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Input format problems running Giraph 1.1.0 on Twitter dataset

Posted by Steven Harenberg <sd...@ncsu.edu>.
My understanding is that a vertex with only incoming edges will not be
active until it receives a message, which is why you don't see all of the
vertices initially. The easiest way to test this is to write a script that
parses your input and creates a new data file where every vertex is
specified on a line of its own. Even if it has no outgoing neighbors, just
leave the neighbor empty. Or, first just check if you have
40383589-40103281=280308 vertices with only incoming edges.

Young provided another solution for fixing the initialization problem, and
it looks like in the code that wasn't specified this code to still have the
problem.

Either transform the input (seems like the easiest thing to do), or try the
fix Young said. I would bet either of those would fix the issue. Young may
have better ideas since he seems more experienced with Giraph than I am.

--Steve

On Sat, May 2, 2015 at 2:19 PM, Kenrick Fernandes <ke...@gmail.com>
wrote:

> Thank you both for your responses.
>
> Steve, I faced the same problem when I created the Long input format
> files.
> I tried running the code linked by Young above, using the
> *SimplePageRankInputFormat.java*
> as well as the *SimplePageRankVertex.java* in the repo.
>
> For the Twitter dataset, I added some *MasterCompute* code to log the
> number of vertices
> that existed at each superstep. The results, however, look pretty similar
> to the previous iteration:
>
> Current step is 1 - 40103281 existed in the previous superstep 0Current step is 2 - 40103281 existed in the previous superstep 1
>
> Current step is 3 - 40383589 existed in the previous superstep 2
>
> Current step is 31 - 40383589 existed in the previous superstep 30
>
> It seems that a subset of vertices still only become active after the
> first superstep,
> despite all vertices being initialized in superstep 0. I cant think of a
> reason why
> - thoughts ?
>
> Thanks,
> Kenrick
>
>
>
> On Wed, Apr 29, 2015 at 2:33 PM, Young Han <yo...@uwaterloo.ca> wrote:
>
>> For the initialization issue, you can define a (nested) class that
>> extends DefaultVertexValueFactory (from org.apache.giraph.factories) and
>> add
>> "-Dgiraph.vertexValueFactoryClass=org.apache.giraph.examples.AlgClass\$AlgVertexValueFactory"
>> after "org.apache.giraph.GiraphRunner" in your hadoop jar command.
>>
>> Also, the reason those input formats don't work is because PageRank is
>> using LongWritable for vertex id and DoubleWritable for vertex value. As
>> Roman pointed out, you have to have an input class that matches it (even if
>> the input dataset has no "double" vertex values). An example (for Giraph
>> 1.0.0) can be found here:
>> https://github.com/xvz/graph-processing/blob/master/giraph-1.0.0/giraph-examples/src/main/java/org/apache/giraph/examples/SimplePageRankInputFormat.java
>> and an example command that uses it here:
>> https://github.com/xvz/graph-processing/blob/master/benchmark/giraph/pagerank.sh#L50
>>
>> Young
>>
>> On Wed, Apr 29, 2015 at 11:24 AM, Steven Harenberg <sd...@ncsu.edu>
>> wrote:
>>
>>> Hey Kenrick,
>>>
>>> First, your commands above are wrong since you are specifying adjacency
>>> list format with the -vif argument and since I believe *LongLongNullTextInputFormat
>>> *refers to adjacency list format. However, even with the right commands
>>> there will be issues and more things you need to do.
>>>
>>> I did get it the edgelist input format to work by creating a
>>> LongNullTextEdgeInputFormat.java file just like the
>>> giraph-core/src/main/java/org/apache/giraph/io/formats/IntNullTextEdgeInputFormat.java
>>> file, but with longs instead of ints (this also required creating a
>>> LongPair class).
>>>
>>> However, I would advise against using an edgelist input format in Giraph
>>> as there are major underlying issues that I never figured out how to
>>> resolve. Namely, for an edgelist format, Giraph only considers a vertex
>>> active in the first superstep if it has an outgoing edge. This means that
>>> vertices with only incoming edges won't be initialized with correct values
>>> during things like PageRank, SSSP, or WCC and hence will output incorrect
>>> results. (You can see my previous thread here:
>>> http://mail-archives.apache.org/mod_mbox/giraph-user/201502.mbox/%3CCAHv2Baw7zFJ-s7dtNMv5dkNxz_zE436krE%2B6G4r3tp-HVgjW2g%40mail.gmail.com%3E
>>> )
>>>
>>> The above issue can be avoided with adjacency list format by specifying
>>> the vertex with no neighbors. For example, if vertex v has only incoming
>>> edges, then you make sure there is a line with just v and no neighbors
>>> listed (
>>> http://mail-archives.apache.org/mod_mbox/giraph-user/201408.mbox/%3C1409255770206.93691@uiowa.edu%3E
>>> ).
>>>
>>> If you figure out how to resolve the edgelist input issue please let me
>>> know.
>>>
>>> Regards,
>>> Steve
>>>
>>>
>>> On Sat, Apr 25, 2015 at 9:54 PM, Kenrick Fernandes <
>>> kenrick.f15@gmail.com> wrote:
>>>
>>>> Hi Roman,
>>>>
>>>> Thanks for the quick response. There is no vertex data in this
>>>> dataset though, and the vertex IDs posted above would fit in a
>>>> Long. Would you advise changing the PageRankComputation
>>>> formats, or working on a new input format ?
>>>>
>>>> Thanks,
>>>> Kenrick
>>>>
>>>> On Sat, Apr 25, 2015 at 7:40 PM, Roman Shaposhnik <roman@shaposhnik.org
>>>> > wrote:
>>>>
>>>>> One of the slightly annoying things in Giraph is that you have
>>>>> to manually match your input format to your computation. In
>>>>> your case, PageRankComputation requires LongWritable for
>>>>> vertex ID and DoubleWritable for vertex Data. You may need
>>>>> to hack one of the existing formats slightly.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Roman.
>>>>>
>>>>> On Sat, Apr 25, 2015 at 2:58 PM, Kenrick Fernandes
>>>>> <ke...@gmail.com> wrote:
>>>>> > Hello,
>>>>> >
>>>>> > Im trying to get Giraph to read the Twitter dataset as input for the
>>>>> > SimplePageRankComputation program. The dataset format looks like
>>>>> this:
>>>>> > 61578010 61147436
>>>>> > 61578037 61147436
>>>>> > 61578040 61147436
>>>>> > (vertex id's, with pairs representing edges)
>>>>> >
>>>>> > When I run the command with
>>>>> > -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat, I get
>>>>> this
>>>>> > error :
>>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex index
>>>>> types not
>>>>> > assignable, computation - class org.apache.hadoop.io.LongWritable,
>>>>> > VertexInputFormat - class org.apache.hadoop.io.IntWritable
>>>>> >
>>>>> > So I tried running the command with
>>>>> > -vif org.apache.giraph.io.formats.LongLongNullTextInputFormat and I
>>>>> get a
>>>>> > different one:
>>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex value
>>>>> types not
>>>>> > assignable, computation - class org.apache.hadoop.io.DoubleWritable,
>>>>> > VertexInputFormat - class org.apache.hadoop.io.LongWritable
>>>>> >
>>>>> > I dont understand why the types in the input show up as different
>>>>> formats in
>>>>> > each error. Also, as far as I could tell, there is no input format
>>>>> for
>>>>> > DoubleDouble. Is there a different way to get the graph into Giraph
>>>>> without
>>>>> > having to write custom input code ? Thoughts would be much
>>>>> appreciated.
>>>>> >
>>>>> > -----
>>>>> > Reference Command:
>>>>> > hadoop jar
>>>>> giraph-examples-1.1.0-for-hadoop-1.1.2-jar-with-dependencies.jar
>>>>> > org.apache.giraph.GiraphRunner
>>>>> > org.apache.giraph.examples.PageRankComputation -vif
>>>>> > org.apache.giraph.io.formats.LongLongNullTextInputFormat -vip
>>>>> > /user/kenrick/twitter/input -op /user/kenrick/twitter/output -w 30
>>>>> > -----
>>>>> >
>>>>> > Thanks,
>>>>> > Kenrick
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Input format problems running Giraph 1.1.0 on Twitter dataset

Posted by Kenrick Fernandes <ke...@gmail.com>.
Thank you both for your responses.

Steve, I faced the same problem when I created the Long input format files.
I tried running the code linked by Young above, using the
*SimplePageRankInputFormat.java*
as well as the *SimplePageRankVertex.java* in the repo.

For the Twitter dataset, I added some *MasterCompute* code to log the
number of vertices
that existed at each superstep. The results, however, look pretty similar
to the previous iteration:

Current step is 1 - 40103281 existed in the previous superstep
0Current step is 2 - 40103281 existed in the previous superstep 1

Current step is 3 - 40383589 existed in the previous superstep 2

Current step is 31 - 40383589 existed in the previous superstep 30

It seems that a subset of vertices still only become active after the first
superstep,
despite all vertices being initialized in superstep 0. I cant think of a
reason why
- thoughts ?

Thanks,
Kenrick



On Wed, Apr 29, 2015 at 2:33 PM, Young Han <yo...@uwaterloo.ca> wrote:

> For the initialization issue, you can define a (nested) class that extends
> DefaultVertexValueFactory (from org.apache.giraph.factories) and add
> "-Dgiraph.vertexValueFactoryClass=org.apache.giraph.examples.AlgClass\$AlgVertexValueFactory"
> after "org.apache.giraph.GiraphRunner" in your hadoop jar command.
>
> Also, the reason those input formats don't work is because PageRank is
> using LongWritable for vertex id and DoubleWritable for vertex value. As
> Roman pointed out, you have to have an input class that matches it (even if
> the input dataset has no "double" vertex values). An example (for Giraph
> 1.0.0) can be found here:
> https://github.com/xvz/graph-processing/blob/master/giraph-1.0.0/giraph-examples/src/main/java/org/apache/giraph/examples/SimplePageRankInputFormat.java
> and an example command that uses it here:
> https://github.com/xvz/graph-processing/blob/master/benchmark/giraph/pagerank.sh#L50
>
> Young
>
> On Wed, Apr 29, 2015 at 11:24 AM, Steven Harenberg <sd...@ncsu.edu>
> wrote:
>
>> Hey Kenrick,
>>
>> First, your commands above are wrong since you are specifying adjacency
>> list format with the -vif argument and since I believe *LongLongNullTextInputFormat
>> *refers to adjacency list format. However, even with the right commands
>> there will be issues and more things you need to do.
>>
>> I did get it the edgelist input format to work by creating a
>> LongNullTextEdgeInputFormat.java file just like the
>> giraph-core/src/main/java/org/apache/giraph/io/formats/IntNullTextEdgeInputFormat.java
>> file, but with longs instead of ints (this also required creating a
>> LongPair class).
>>
>> However, I would advise against using an edgelist input format in Giraph
>> as there are major underlying issues that I never figured out how to
>> resolve. Namely, for an edgelist format, Giraph only considers a vertex
>> active in the first superstep if it has an outgoing edge. This means that
>> vertices with only incoming edges won't be initialized with correct values
>> during things like PageRank, SSSP, or WCC and hence will output incorrect
>> results. (You can see my previous thread here:
>> http://mail-archives.apache.org/mod_mbox/giraph-user/201502.mbox/%3CCAHv2Baw7zFJ-s7dtNMv5dkNxz_zE436krE%2B6G4r3tp-HVgjW2g%40mail.gmail.com%3E
>> )
>>
>> The above issue can be avoided with adjacency list format by specifying
>> the vertex with no neighbors. For example, if vertex v has only incoming
>> edges, then you make sure there is a line with just v and no neighbors
>> listed (
>> http://mail-archives.apache.org/mod_mbox/giraph-user/201408.mbox/%3C1409255770206.93691@uiowa.edu%3E
>> ).
>>
>> If you figure out how to resolve the edgelist input issue please let me
>> know.
>>
>> Regards,
>> Steve
>>
>>
>> On Sat, Apr 25, 2015 at 9:54 PM, Kenrick Fernandes <kenrick.f15@gmail.com
>> > wrote:
>>
>>> Hi Roman,
>>>
>>> Thanks for the quick response. There is no vertex data in this
>>> dataset though, and the vertex IDs posted above would fit in a
>>> Long. Would you advise changing the PageRankComputation
>>> formats, or working on a new input format ?
>>>
>>> Thanks,
>>> Kenrick
>>>
>>> On Sat, Apr 25, 2015 at 7:40 PM, Roman Shaposhnik <ro...@shaposhnik.org>
>>> wrote:
>>>
>>>> One of the slightly annoying things in Giraph is that you have
>>>> to manually match your input format to your computation. In
>>>> your case, PageRankComputation requires LongWritable for
>>>> vertex ID and DoubleWritable for vertex Data. You may need
>>>> to hack one of the existing formats slightly.
>>>>
>>>>
>>>> Thanks,
>>>> Roman.
>>>>
>>>> On Sat, Apr 25, 2015 at 2:58 PM, Kenrick Fernandes
>>>> <ke...@gmail.com> wrote:
>>>> > Hello,
>>>> >
>>>> > Im trying to get Giraph to read the Twitter dataset as input for the
>>>> > SimplePageRankComputation program. The dataset format looks like this:
>>>> > 61578010 61147436
>>>> > 61578037 61147436
>>>> > 61578040 61147436
>>>> > (vertex id's, with pairs representing edges)
>>>> >
>>>> > When I run the command with
>>>> > -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat, I get
>>>> this
>>>> > error :
>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex index
>>>> types not
>>>> > assignable, computation - class org.apache.hadoop.io.LongWritable,
>>>> > VertexInputFormat - class org.apache.hadoop.io.IntWritable
>>>> >
>>>> > So I tried running the command with
>>>> > -vif org.apache.giraph.io.formats.LongLongNullTextInputFormat and I
>>>> get a
>>>> > different one:
>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex value
>>>> types not
>>>> > assignable, computation - class org.apache.hadoop.io.DoubleWritable,
>>>> > VertexInputFormat - class org.apache.hadoop.io.LongWritable
>>>> >
>>>> > I dont understand why the types in the input show up as different
>>>> formats in
>>>> > each error. Also, as far as I could tell, there is no input format for
>>>> > DoubleDouble. Is there a different way to get the graph into Giraph
>>>> without
>>>> > having to write custom input code ? Thoughts would be much
>>>> appreciated.
>>>> >
>>>> > -----
>>>> > Reference Command:
>>>> > hadoop jar
>>>> giraph-examples-1.1.0-for-hadoop-1.1.2-jar-with-dependencies.jar
>>>> > org.apache.giraph.GiraphRunner
>>>> > org.apache.giraph.examples.PageRankComputation -vif
>>>> > org.apache.giraph.io.formats.LongLongNullTextInputFormat -vip
>>>> > /user/kenrick/twitter/input -op /user/kenrick/twitter/output -w 30
>>>> > -----
>>>> >
>>>> > Thanks,
>>>> > Kenrick
>>>>
>>>
>>>
>>
>

Re: Input format problems running Giraph 1.1.0 on Twitter dataset

Posted by Young Han <yo...@uwaterloo.ca>.
For the initialization issue, you can define a (nested) class that extends
DefaultVertexValueFactory (from org.apache.giraph.factories) and add
"-Dgiraph.vertexValueFactoryClass=org.apache.giraph.examples.AlgClass\$AlgVertexValueFactory"
after "org.apache.giraph.GiraphRunner" in your hadoop jar command.

Also, the reason those input formats don't work is because PageRank is
using LongWritable for vertex id and DoubleWritable for vertex value. As
Roman pointed out, you have to have an input class that matches it (even if
the input dataset has no "double" vertex values). An example (for Giraph
1.0.0) can be found here:
https://github.com/xvz/graph-processing/blob/master/giraph-1.0.0/giraph-examples/src/main/java/org/apache/giraph/examples/SimplePageRankInputFormat.java
and an example command that uses it here:
https://github.com/xvz/graph-processing/blob/master/benchmark/giraph/pagerank.sh#L50

Young

On Wed, Apr 29, 2015 at 11:24 AM, Steven Harenberg <sd...@ncsu.edu>
wrote:

> Hey Kenrick,
>
> First, your commands above are wrong since you are specifying adjacency
> list format with the -vif argument and since I believe *LongLongNullTextInputFormat
> *refers to adjacency list format. However, even with the right commands
> there will be issues and more things you need to do.
>
> I did get it the edgelist input format to work by creating a
> LongNullTextEdgeInputFormat.java file just like the
> giraph-core/src/main/java/org/apache/giraph/io/formats/IntNullTextEdgeInputFormat.java
> file, but with longs instead of ints (this also required creating a
> LongPair class).
>
> However, I would advise against using an edgelist input format in Giraph
> as there are major underlying issues that I never figured out how to
> resolve. Namely, for an edgelist format, Giraph only considers a vertex
> active in the first superstep if it has an outgoing edge. This means that
> vertices with only incoming edges won't be initialized with correct values
> during things like PageRank, SSSP, or WCC and hence will output incorrect
> results. (You can see my previous thread here:
> http://mail-archives.apache.org/mod_mbox/giraph-user/201502.mbox/%3CCAHv2Baw7zFJ-s7dtNMv5dkNxz_zE436krE%2B6G4r3tp-HVgjW2g%40mail.gmail.com%3E
> )
>
> The above issue can be avoided with adjacency list format by specifying
> the vertex with no neighbors. For example, if vertex v has only incoming
> edges, then you make sure there is a line with just v and no neighbors
> listed (
> http://mail-archives.apache.org/mod_mbox/giraph-user/201408.mbox/%3C1409255770206.93691@uiowa.edu%3E
> ).
>
> If you figure out how to resolve the edgelist input issue please let me
> know.
>
> Regards,
> Steve
>
>
> On Sat, Apr 25, 2015 at 9:54 PM, Kenrick Fernandes <ke...@gmail.com>
> wrote:
>
>> Hi Roman,
>>
>> Thanks for the quick response. There is no vertex data in this
>> dataset though, and the vertex IDs posted above would fit in a
>> Long. Would you advise changing the PageRankComputation
>> formats, or working on a new input format ?
>>
>> Thanks,
>> Kenrick
>>
>> On Sat, Apr 25, 2015 at 7:40 PM, Roman Shaposhnik <ro...@shaposhnik.org>
>> wrote:
>>
>>> One of the slightly annoying things in Giraph is that you have
>>> to manually match your input format to your computation. In
>>> your case, PageRankComputation requires LongWritable for
>>> vertex ID and DoubleWritable for vertex Data. You may need
>>> to hack one of the existing formats slightly.
>>>
>>>
>>> Thanks,
>>> Roman.
>>>
>>> On Sat, Apr 25, 2015 at 2:58 PM, Kenrick Fernandes
>>> <ke...@gmail.com> wrote:
>>> > Hello,
>>> >
>>> > Im trying to get Giraph to read the Twitter dataset as input for the
>>> > SimplePageRankComputation program. The dataset format looks like this:
>>> > 61578010 61147436
>>> > 61578037 61147436
>>> > 61578040 61147436
>>> > (vertex id's, with pairs representing edges)
>>> >
>>> > When I run the command with
>>> > -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat, I get this
>>> > error :
>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex index
>>> types not
>>> > assignable, computation - class org.apache.hadoop.io.LongWritable,
>>> > VertexInputFormat - class org.apache.hadoop.io.IntWritable
>>> >
>>> > So I tried running the command with
>>> > -vif org.apache.giraph.io.formats.LongLongNullTextInputFormat and I
>>> get a
>>> > different one:
>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex value
>>> types not
>>> > assignable, computation - class org.apache.hadoop.io.DoubleWritable,
>>> > VertexInputFormat - class org.apache.hadoop.io.LongWritable
>>> >
>>> > I dont understand why the types in the input show up as different
>>> formats in
>>> > each error. Also, as far as I could tell, there is no input format for
>>> > DoubleDouble. Is there a different way to get the graph into Giraph
>>> without
>>> > having to write custom input code ? Thoughts would be much appreciated.
>>> >
>>> > -----
>>> > Reference Command:
>>> > hadoop jar
>>> giraph-examples-1.1.0-for-hadoop-1.1.2-jar-with-dependencies.jar
>>> > org.apache.giraph.GiraphRunner
>>> > org.apache.giraph.examples.PageRankComputation -vif
>>> > org.apache.giraph.io.formats.LongLongNullTextInputFormat -vip
>>> > /user/kenrick/twitter/input -op /user/kenrick/twitter/output -w 30
>>> > -----
>>> >
>>> > Thanks,
>>> > Kenrick
>>>
>>
>>
>

Re: Input format problems running Giraph 1.1.0 on Twitter dataset

Posted by Steven Harenberg <sd...@ncsu.edu>.
Hey Kenrick,

First, your commands above are wrong since you are specifying adjacency
list format with the -vif argument and since I believe
*LongLongNullTextInputFormat
*refers to adjacency list format. However, even with the right commands
there will be issues and more things you need to do.

I did get it the edgelist input format to work by creating a
LongNullTextEdgeInputFormat.java file just like the
giraph-core/src/main/java/org/apache/giraph/io/formats/IntNullTextEdgeInputFormat.java
file, but with longs instead of ints (this also required creating a
LongPair class).

However, I would advise against using an edgelist input format in Giraph as
there are major underlying issues that I never figured out how to resolve.
Namely, for an edgelist format, Giraph only considers a vertex active in
the first superstep if it has an outgoing edge. This means that vertices
with only incoming edges won't be initialized with correct values during
things like PageRank, SSSP, or WCC and hence will output incorrect results.
(You can see my previous thread here:
http://mail-archives.apache.org/mod_mbox/giraph-user/201502.mbox/%3CCAHv2Baw7zFJ-s7dtNMv5dkNxz_zE436krE%2B6G4r3tp-HVgjW2g%40mail.gmail.com%3E
)

The above issue can be avoided with adjacency list format by specifying the
vertex with no neighbors. For example, if vertex v has only incoming edges,
then you make sure there is a line with just v and no neighbors listed (
http://mail-archives.apache.org/mod_mbox/giraph-user/201408.mbox/%3C1409255770206.93691@uiowa.edu%3E
).

If you figure out how to resolve the edgelist input issue please let me
know.

Regards,
Steve


On Sat, Apr 25, 2015 at 9:54 PM, Kenrick Fernandes <ke...@gmail.com>
wrote:

> Hi Roman,
>
> Thanks for the quick response. There is no vertex data in this
> dataset though, and the vertex IDs posted above would fit in a
> Long. Would you advise changing the PageRankComputation
> formats, or working on a new input format ?
>
> Thanks,
> Kenrick
>
> On Sat, Apr 25, 2015 at 7:40 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> wrote:
>
>> One of the slightly annoying things in Giraph is that you have
>> to manually match your input format to your computation. In
>> your case, PageRankComputation requires LongWritable for
>> vertex ID and DoubleWritable for vertex Data. You may need
>> to hack one of the existing formats slightly.
>>
>>
>> Thanks,
>> Roman.
>>
>> On Sat, Apr 25, 2015 at 2:58 PM, Kenrick Fernandes
>> <ke...@gmail.com> wrote:
>> > Hello,
>> >
>> > Im trying to get Giraph to read the Twitter dataset as input for the
>> > SimplePageRankComputation program. The dataset format looks like this:
>> > 61578010 61147436
>> > 61578037 61147436
>> > 61578040 61147436
>> > (vertex id's, with pairs representing edges)
>> >
>> > When I run the command with
>> > -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat, I get this
>> > error :
>> > java.lang.IllegalArgumentException: checkClassTypes: vertex index types
>> not
>> > assignable, computation - class org.apache.hadoop.io.LongWritable,
>> > VertexInputFormat - class org.apache.hadoop.io.IntWritable
>> >
>> > So I tried running the command with
>> > -vif org.apache.giraph.io.formats.LongLongNullTextInputFormat and I get
>> a
>> > different one:
>> > java.lang.IllegalArgumentException: checkClassTypes: vertex value types
>> not
>> > assignable, computation - class org.apache.hadoop.io.DoubleWritable,
>> > VertexInputFormat - class org.apache.hadoop.io.LongWritable
>> >
>> > I dont understand why the types in the input show up as different
>> formats in
>> > each error. Also, as far as I could tell, there is no input format for
>> > DoubleDouble. Is there a different way to get the graph into Giraph
>> without
>> > having to write custom input code ? Thoughts would be much appreciated.
>> >
>> > -----
>> > Reference Command:
>> > hadoop jar
>> giraph-examples-1.1.0-for-hadoop-1.1.2-jar-with-dependencies.jar
>> > org.apache.giraph.GiraphRunner
>> > org.apache.giraph.examples.PageRankComputation -vif
>> > org.apache.giraph.io.formats.LongLongNullTextInputFormat -vip
>> > /user/kenrick/twitter/input -op /user/kenrick/twitter/output -w 30
>> > -----
>> >
>> > Thanks,
>> > Kenrick
>>
>
>

Re: Input format problems running Giraph 1.1.0 on Twitter dataset

Posted by Kenrick Fernandes <ke...@gmail.com>.
Hi Roman,

Thanks for the quick response. There is no vertex data in this
dataset though, and the vertex IDs posted above would fit in a
Long. Would you advise changing the PageRankComputation
formats, or working on a new input format ?

Thanks,
Kenrick

On Sat, Apr 25, 2015 at 7:40 PM, Roman Shaposhnik <ro...@shaposhnik.org>
wrote:

> One of the slightly annoying things in Giraph is that you have
> to manually match your input format to your computation. In
> your case, PageRankComputation requires LongWritable for
> vertex ID and DoubleWritable for vertex Data. You may need
> to hack one of the existing formats slightly.
>
>
> Thanks,
> Roman.
>
> On Sat, Apr 25, 2015 at 2:58 PM, Kenrick Fernandes
> <ke...@gmail.com> wrote:
> > Hello,
> >
> > Im trying to get Giraph to read the Twitter dataset as input for the
> > SimplePageRankComputation program. The dataset format looks like this:
> > 61578010 61147436
> > 61578037 61147436
> > 61578040 61147436
> > (vertex id's, with pairs representing edges)
> >
> > When I run the command with
> > -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat, I get this
> > error :
> > java.lang.IllegalArgumentException: checkClassTypes: vertex index types
> not
> > assignable, computation - class org.apache.hadoop.io.LongWritable,
> > VertexInputFormat - class org.apache.hadoop.io.IntWritable
> >
> > So I tried running the command with
> > -vif org.apache.giraph.io.formats.LongLongNullTextInputFormat and I get a
> > different one:
> > java.lang.IllegalArgumentException: checkClassTypes: vertex value types
> not
> > assignable, computation - class org.apache.hadoop.io.DoubleWritable,
> > VertexInputFormat - class org.apache.hadoop.io.LongWritable
> >
> > I dont understand why the types in the input show up as different
> formats in
> > each error. Also, as far as I could tell, there is no input format for
> > DoubleDouble. Is there a different way to get the graph into Giraph
> without
> > having to write custom input code ? Thoughts would be much appreciated.
> >
> > -----
> > Reference Command:
> > hadoop jar
> giraph-examples-1.1.0-for-hadoop-1.1.2-jar-with-dependencies.jar
> > org.apache.giraph.GiraphRunner
> > org.apache.giraph.examples.PageRankComputation -vif
> > org.apache.giraph.io.formats.LongLongNullTextInputFormat -vip
> > /user/kenrick/twitter/input -op /user/kenrick/twitter/output -w 30
> > -----
> >
> > Thanks,
> > Kenrick
>

Re: Input format problems running Giraph 1.1.0 on Twitter dataset

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.
One of the slightly annoying things in Giraph is that you have
to manually match your input format to your computation. In
your case, PageRankComputation requires LongWritable for
vertex ID and DoubleWritable for vertex Data. You may need
to hack one of the existing formats slightly.


Thanks,
Roman.

On Sat, Apr 25, 2015 at 2:58 PM, Kenrick Fernandes
<ke...@gmail.com> wrote:
> Hello,
>
> Im trying to get Giraph to read the Twitter dataset as input for the
> SimplePageRankComputation program. The dataset format looks like this:
> 61578010 61147436
> 61578037 61147436
> 61578040 61147436
> (vertex id's, with pairs representing edges)
>
> When I run the command with
> -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat, I get this
> error :
> java.lang.IllegalArgumentException: checkClassTypes: vertex index types not
> assignable, computation - class org.apache.hadoop.io.LongWritable,
> VertexInputFormat - class org.apache.hadoop.io.IntWritable
>
> So I tried running the command with
> -vif org.apache.giraph.io.formats.LongLongNullTextInputFormat and I get a
> different one:
> java.lang.IllegalArgumentException: checkClassTypes: vertex value types not
> assignable, computation - class org.apache.hadoop.io.DoubleWritable,
> VertexInputFormat - class org.apache.hadoop.io.LongWritable
>
> I dont understand why the types in the input show up as different formats in
> each error. Also, as far as I could tell, there is no input format for
> DoubleDouble. Is there a different way to get the graph into Giraph without
> having to write custom input code ? Thoughts would be much appreciated.
>
> -----
> Reference Command:
> hadoop jar giraph-examples-1.1.0-for-hadoop-1.1.2-jar-with-dependencies.jar
> org.apache.giraph.GiraphRunner
> org.apache.giraph.examples.PageRankComputation -vif
> org.apache.giraph.io.formats.LongLongNullTextInputFormat -vip
> /user/kenrick/twitter/input -op /user/kenrick/twitter/output -w 30
> -----
>
> Thanks,
> Kenrick