You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by "Thomas Jungblut (JIRA)" <ji...@apache.org> on 2012/05/22 22:59:41 UTC

[jira] [Created] (HAMA-580) Improve input of graph module

Thomas Jungblut created HAMA-580:
------------------------------------

             Summary: Improve input of graph module
                 Key: HAMA-580
                 URL: https://issues.apache.org/jira/browse/HAMA-580
             Project: Hama
          Issue Type: Improvement
          Components: graph
    Affects Versions: 0.5.0
            Reporter: Thomas Jungblut
            Assignee: Thomas Jungblut
             Fix For: 0.5.0


Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (HAMA-580) Improve input of graph module

Posted by Thomas Jungblut <th...@googlemail.com>.
Great you share that, so let's do it like this.

2012/5/24 Edward J. Yoon <ed...@apache.org>

> > However it might be a good thing to consider that giraph is supporting
> all
> > inputformats and have a input key/value to vertex parser that runs when
> > loading vertices.
> > This would shift the responsibility to the user and we would remove
> > Writability of the vertices, thus removing the VertexWritable classes.
>
> +1
>
> On Thu, May 24, 2012 at 4:30 PM, Thomas Jungblut
> <th...@googlemail.com> wrote:
> > Can't post to jira because it is down or has high latency.
> >
> > I dislike the idea as well, but it is the most optimal case to write the
> > vertices.
> > Consider the Wikipedia linkset, 1gb of text data as adjacency list.
> > With current trunk version it has at most 10gb.
> > I have no clear check of how it is with that patch, but I assume that it
> > will be less than 1gb.
> > Suppose you have 64mb chunksize in HDFS, meaning 160 bsp tasks to be
> > launched, as opposed to 16 for the most optimal case.
> > I don't know if that's an argument for you. Compatibility to MapReduce
> > shouldn't be our first aim, we can make a BSP job out of the random graph
> > generator.
> > However it might be a good thing to consider that giraph is supporting
> all
> > inputformats and have a input key/value to vertex parser that runs when
> > loading vertices.
> > This would shift the responsibility to the user and we would remove
> > Writability of the vertices, thus removing the VertexWritable classes.
> >
> > If you have a good trade-off idea, let me know.
> >
> >
> > 2012/5/24 Edward J. Yoon (JIRA) <ji...@apache.org>
> >
> >>
> >>    [
> >>
> https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282244#comment-13282244
> ]
> >>
> >> Edward J. Yoon commented on HAMA-580:
> >> -------------------------------------
> >>
> >> I dislike this idea. This makes programming complex and discourages use
> of
> >> existing Mapper/Reducer e.g., Reducer, LongSumReducer, ...
> >>
> >> > Improve input of graph module
> >> > -----------------------------
> >> >
> >> >                 Key: HAMA-580
> >> >                 URL: https://issues.apache.org/jira/browse/HAMA-580
> >> >             Project: Hama
> >> >          Issue Type: Improvement
> >> >          Components: graph
> >> >    Affects Versions: 0.5.0
> >> >            Reporter: Thomas Jungblut
> >> >            Assignee: Thomas Jungblut
> >> >             Fix For: 0.5.0
> >> >
> >> >         Attachments: HAMA-580.patch, HAMA-580_1.patch
> >> >
> >> >
> >> > Currently it is too verbose, the wikipedia dataset is going to be
> >> bloated from 0.95gb to 5gb just because it is writing the classes
> x-times.
> >>
> >> --
> >> This message is automatically generated by JIRA.
> >> If you think it was sent incorrectly, please contact your JIRA
> >> administrators:
> >>
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> >> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >>
> >>
> >>
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <th...@gmail.com>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>



-- 
Thomas Jungblut
Berlin <th...@gmail.com>

Re: [jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Edward J. Yoon" <ed...@apache.org>.
> However it might be a good thing to consider that giraph is supporting all
> inputformats and have a input key/value to vertex parser that runs when
> loading vertices.
> This would shift the responsibility to the user and we would remove
> Writability of the vertices, thus removing the VertexWritable classes.

+1

On Thu, May 24, 2012 at 4:30 PM, Thomas Jungblut
<th...@googlemail.com> wrote:
> Can't post to jira because it is down or has high latency.
>
> I dislike the idea as well, but it is the most optimal case to write the
> vertices.
> Consider the Wikipedia linkset, 1gb of text data as adjacency list.
> With current trunk version it has at most 10gb.
> I have no clear check of how it is with that patch, but I assume that it
> will be less than 1gb.
> Suppose you have 64mb chunksize in HDFS, meaning 160 bsp tasks to be
> launched, as opposed to 16 for the most optimal case.
> I don't know if that's an argument for you. Compatibility to MapReduce
> shouldn't be our first aim, we can make a BSP job out of the random graph
> generator.
> However it might be a good thing to consider that giraph is supporting all
> inputformats and have a input key/value to vertex parser that runs when
> loading vertices.
> This would shift the responsibility to the user and we would remove
> Writability of the vertices, thus removing the VertexWritable classes.
>
> If you have a good trade-off idea, let me know.
>
>
> 2012/5/24 Edward J. Yoon (JIRA) <ji...@apache.org>
>
>>
>>    [
>> https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282244#comment-13282244]
>>
>> Edward J. Yoon commented on HAMA-580:
>> -------------------------------------
>>
>> I dislike this idea. This makes programming complex and discourages use of
>> existing Mapper/Reducer e.g., Reducer, LongSumReducer, ...
>>
>> > Improve input of graph module
>> > -----------------------------
>> >
>> >                 Key: HAMA-580
>> >                 URL: https://issues.apache.org/jira/browse/HAMA-580
>> >             Project: Hama
>> >          Issue Type: Improvement
>> >          Components: graph
>> >    Affects Versions: 0.5.0
>> >            Reporter: Thomas Jungblut
>> >            Assignee: Thomas Jungblut
>> >             Fix For: 0.5.0
>> >
>> >         Attachments: HAMA-580.patch, HAMA-580_1.patch
>> >
>> >
>> > Currently it is too verbose, the wikipedia dataset is going to be
>> bloated from 0.95gb to 5gb just because it is writing the classes x-times.
>>
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA
>> administrators:
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>>
>
>
> --
> Thomas Jungblut
> Berlin <th...@gmail.com>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [jira] [Commented] (HAMA-580) Improve input of graph module

Posted by Thomas Jungblut <th...@googlemail.com>.
Can't post to jira because it is down or has high latency.

I dislike the idea as well, but it is the most optimal case to write the
vertices.
Consider the Wikipedia linkset, 1gb of text data as adjacency list.
With current trunk version it has at most 10gb.
I have no clear check of how it is with that patch, but I assume that it
will be less than 1gb.
Suppose you have 64mb chunksize in HDFS, meaning 160 bsp tasks to be
launched, as opposed to 16 for the most optimal case.
I don't know if that's an argument for you. Compatibility to MapReduce
shouldn't be our first aim, we can make a BSP job out of the random graph
generator.
However it might be a good thing to consider that giraph is supporting all
inputformats and have a input key/value to vertex parser that runs when
loading vertices.
This would shift the responsibility to the user and we would remove
Writability of the vertices, thus removing the VertexWritable classes.

If you have a good trade-off idea, let me know.


2012/5/24 Edward J. Yoon (JIRA) <ji...@apache.org>

>
>    [
> https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282244#comment-13282244]
>
> Edward J. Yoon commented on HAMA-580:
> -------------------------------------
>
> I dislike this idea. This makes programming complex and discourages use of
> existing Mapper/Reducer e.g., Reducer, LongSumReducer, ...
>
> > Improve input of graph module
> > -----------------------------
> >
> >                 Key: HAMA-580
> >                 URL: https://issues.apache.org/jira/browse/HAMA-580
> >             Project: Hama
> >          Issue Type: Improvement
> >          Components: graph
> >    Affects Versions: 0.5.0
> >            Reporter: Thomas Jungblut
> >            Assignee: Thomas Jungblut
> >             Fix For: 0.5.0
> >
> >         Attachments: HAMA-580.patch, HAMA-580_1.patch
> >
> >
> > Currently it is too verbose, the wikipedia dataset is going to be
> bloated from 0.95gb to 5gb just because it is writing the classes x-times.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>


-- 
Thomas Jungblut
Berlin <th...@gmail.com>

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282232#comment-13282232 ] 

Edward J. Yoon commented on HAMA-580:
-------------------------------------

So, should I implement my own Reducer to set them instead of Reducer.class?

In my opinion, VertexWritable should be able to used easily in any program e.g., M/R, user application, ...
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282104#comment-13282104 ] 

Edward J. Yoon commented on HAMA-580:
-------------------------------------

As I mentioned yesterday, below code will throw ClassCastException.

{code}
  @Override
  public int compareTo(VertexWritable o) {
    return ((VertexWritable) this.vertexId).compareTo((VertexWritable) o
        .getVertexId());
  }
{code}

                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282226#comment-13282226 ] 

Thomas Jungblut commented on HAMA-580:
--------------------------------------

You understand how it works?

{noformat}
    Configuration conf = new Configuration();
    VertexWritable.CONFIGURATION = conf;
    VertexArrayWritable.CONFIGURATION = conf;
    VertexWritable.VERTEX_ID_CLASS = Text.class;
    VertexWritable.VERTEX_VALUE_CLASS = IntWritable.class;
    VertexArrayWritable.EDGE_ID_CLASS = Text.class;
    VertexArrayWritable.EDGE_VALUE_CLASS = IntWritable.class;
{noformat}

bq.Should we use template again?
The big-file issues arise from templates, mainly because you have to write the classnames for every given part of a vertex.
So you have x-thousand times the same classnames as UTF8 string in the file. Which is not needed, because the classes should be constant for each vertex and known at clientside as well as on job side.

The static setting is not a very good solution, but saves soo much space compared to fields.
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282233#comment-13282233 ] 

Edward J. Yoon commented on HAMA-580:
-------------------------------------

Why not use constructor?
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282244#comment-13282244 ] 

Edward J. Yoon commented on HAMA-580:
-------------------------------------

I dislike this idea. This makes programming complex and discourages use of existing Mapper/Reducer e.g., Reducer, LongSumReducer, ...
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283360#comment-13283360 ] 

Edward J. Yoon commented on HAMA-580:
-------------------------------------

Oh, OKay.
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282206#comment-13282206 ] 

Thomas Jungblut commented on HAMA-580:
--------------------------------------

Yes.
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282224#comment-13282224 ] 

Edward J. Yoon commented on HAMA-580:
-------------------------------------

Tried to generate random graph using your MR job but, 

{code}
12/05/24 15:07:38 INFO mapred.JobClient: Task Id : attempt_201205241504_0002_r_000000_2, Status : FAILED
java.lang.RuntimeException: java.lang.NullPointerException
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
        at org.apache.hama.graph.VertexWritable.readFields(VertexWritable.java:54)
        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:113)
        at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.NullPointerException
        at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:107)
        ... 13 more
{code}

Should we use template again?
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281411#comment-13281411 ] 

Thomas Jungblut commented on HAMA-580:
--------------------------------------

Great to hear, hopefully this issue will not destroy your cluster ;)

Basically we can simply write the ID and VALUE of a vertex directly, but they must meet the requirement that the client and the task has the classes available to deserialize.
So for example for Pagerank:
Text/DoubleWritable (this can be null though) as vertex serialization and ArrayWritable its just Text/NullWritable. 
VertexArrayWritable is very verbose as well and I can rewrite it. 
I would solve this like in GraphJobRunner is dealing with the vertex messages, by setting the classes statically and then just reading them back. 

Sorry for that problem, but I just wanted to make it run when coding the generics for the graph package.
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284011#comment-13284011 ] 

Thomas Jungblut commented on HAMA-580:
--------------------------------------

Committed now. Will update http://wiki.apache.org/hama/WriteHamaGraphFile soon with it.
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch, HAMA-580_2.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-580) Improve input of graph module

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut updated HAMA-580:
---------------------------------

    Attachment: HAMA-580.patch

Will fix that MR job when this here is committed to trunk. Then you shouldn't have these EOF exceptions.

testcases are fine now.

{noformat}
[INFO] --- maven-javadoc-plugin:2.6:aggregate-jar (default) @ hama-dist ---
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Hama parent POM ............................ SUCCESS [21.661s]
[INFO] core .............................................. SUCCESS [7:24.543s]
[INFO] graph ............................................. SUCCESS [2.694s]
[INFO] examples .......................................... SUCCESS [1:00.114s]
[INFO] yarn .............................................. SUCCESS [3.492s]
[INFO] hama-dist ......................................... SUCCESS [29.964s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9:22.994s
[INFO] Finished at: Wed May 23 20:13:33 CEST 2012
[INFO] Final Memory: 89M/974M

{noformat}
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (HAMA-580) Improve input of graph module

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut resolved HAMA-580.
----------------------------------

    Resolution: Fixed
    
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch, HAMA-580_2.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283359#comment-13283359 ] 

Thomas Jungblut commented on HAMA-580:
--------------------------------------

Just like told somewhere, you're writing the file wrong.

The fixed byte alignment must be like this for SSSP:
VertexID, VertexVALUE -> n times the same.
Text, IntWritable, VertexArrayWritable<Text, IntWritable>

Either fix the mapreduce job accordingly to take the classes or use Pagerank which is:
VertexID, VertexVALUE -> n times quite the same
Text, DoubleWritable -> VertexArrayWritable<Text, NullWritable>

I'm taking a bit off of hama for the next few days, I'd script the proposed Inputformatter for the user sunday/monday.
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284088#comment-13284088 ] 

Hudson commented on HAMA-580:
-----------------------------

Integrated in Hama-Nightly #559 (See [https://builds.apache.org/job/Hama-Nightly/559/])
    [HAMA-580]: Improve input of graph module (Revision 1342922)

     Result = SUCCESS
tjungblut : 
Files : 
* /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/TextInputFormat.java
* /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/ExampleDriver.java
* /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/InlinkCount.java
* /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/MindistSearch.java
* /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/PageRank.java
* /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/SSSP.java
* /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/util
* /incubator/hama/trunk/examples/src/test/java/org/apache/hama/examples/MindistSearchTest.java
* /incubator/hama/trunk/examples/src/test/java/org/apache/hama/examples/PageRankTest.java
* /incubator/hama/trunk/examples/src/test/java/org/apache/hama/examples/SSSPTest.java
* /incubator/hama/trunk/examples/src/test/java/org/apache/hama/examples/util
* /incubator/hama/trunk/graph/src/main/java/org/apache/hama/graph/Edge.java
* /incubator/hama/trunk/graph/src/main/java/org/apache/hama/graph/GraphJob.java
* /incubator/hama/trunk/graph/src/main/java/org/apache/hama/graph/GraphJobMessage.java
* /incubator/hama/trunk/graph/src/main/java/org/apache/hama/graph/GraphJobRunner.java
* /incubator/hama/trunk/graph/src/main/java/org/apache/hama/graph/Vertex.java
* /incubator/hama/trunk/graph/src/main/java/org/apache/hama/graph/VertexArrayWritable.java
* /incubator/hama/trunk/graph/src/main/java/org/apache/hama/graph/VertexInputReader.java
* /incubator/hama/trunk/graph/src/main/java/org/apache/hama/graph/VertexInterface.java
* /incubator/hama/trunk/graph/src/main/java/org/apache/hama/graph/VertexWritable.java
* /incubator/hama/trunk/graph/src/test/java/org/apache/hama/graph/TestSubmitGraphJob.java
* /incubator/hama/trunk/graph/src/test/java/org/apache/hama/graph/example/PageRank.java

                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch, HAMA-580_2.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-580) Improve input of graph module

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut updated HAMA-580:
---------------------------------

    Attachment: HAMA-580_2.patch

Wow it is soo much cleaner now :D

I've rewritten the testcases to use textfiles, maybe better for users working on shells and playing arround.

Then I have added this vertex input reader you can extend from. 

BTW I have removed the seq2text utils and the vertex writable classes. These delete operations are merely not in the patch.

Please review this, testcases are fine. 
@Edward, the MapReduce job must now be rewritten to output Text in a specific formatting, you can find the formatting documented in each examples reader.

Should we add something equally for the output? Maybe better than just outputting the value and id of the vertex.
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch, HAMA-580_2.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281325#comment-13281325 ] 

Edward J. Yoon commented on HAMA-580:
-------------------------------------

Great improvements!

I'll check TRUNK (changes you made) on my fully distributed cluster and feedback to you Today.
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283587#comment-13283587 ] 

Thomas Jungblut commented on HAMA-580:
--------------------------------------

But scratch that, I'll give you a patch now.
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-580) Improve input of graph module

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut updated HAMA-580:
---------------------------------

    Attachment: HAMA-580_1.patch

Fixed that comparable. Will fix the MR job then.
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-580) Improve input of graph module

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283351#comment-13283351 ] 

Edward J. Yoon commented on HAMA-580:
-------------------------------------

Could you please fix only EOF exception problem at the moment?
                
> Improve input of graph module
> -----------------------------
>
>                 Key: HAMA-580
>                 URL: https://issues.apache.org/jira/browse/HAMA-580
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>            Assignee: Thomas Jungblut
>             Fix For: 0.5.0
>
>         Attachments: HAMA-580.patch, HAMA-580_1.patch
>
>
> Currently it is too verbose, the wikipedia dataset is going to be bloated from 0.95gb to 5gb just because it is writing the classes x-times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira