You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@giraph.apache.org by Paolo Castagna <ca...@googlemail.com> on 2012/04/05 06:45:48 UTC

On helping new contributors pitch in quickly...

"""
To help new contributors pitch in quickly, we maintain a set of JIRAs [1] that
focus on getting new contributors started with the mechanics of generating a
patch — downloading the source, changing a couple lines, creating a patch,
verifying its correctness, uploading it to JIRA and working with the community —
rather that deep technical issues within Giraph itself. These are good issues
with which to join the community.
"""

This is nice, good idea indeed.

Put more issues there (even if, at the moment, there does not seems to be much
"simple" stuff that will get people started around). Things such as "port Giraph
to YARN" or a new RPC layer are a bit scary for those just starting (like me). :-)

Perhaps, another option is to increase number of examples. You already have a
few interesting one, do you have one or two ideas on a couple of examples which
could be added to Giraph?

Paolo

 [1] http://bit.ly/newbie_apache_giraph_issues

Re: On helping new contributors pitch in quickly...

Posted by Avery Ching <ac...@apache.org>.

Here is a related JIRA https://issues.apache.org/jira/browse/GIRAPH-155

Avery

On 4/5/12 9:45 AM, Paolo Castagna wrote:
> Hi Dan,
> I have not an answer to your questions/observations yet.
>
> However, I suspect N-Triples | N-Quads might not be the best option for
> something like Giraph. Something more like an adjacency list might be
> better.
>
> So, my intuition, is that if you start with RDF in N-Triples format,
> the first step would be a simple MapReduce job to group RDF statements
> by subject (eventually filtering out certain properties):
>
> Input:
>
>    s1 --p1-->  o1
>    s1 --p2-->  o2
>    s1 --p2-->  o3
>    s2 ...
>
> Output (adjacency list):
>
>    s1 (p1 o1) (p2 o2) (p2 o3)
>    s2 ...
>
> But, as I said, is it too early for me to say definitely this is the
> best approach.
>
> Paolo
>
> Dan Brickley wrote:
>> On 5 April 2012 05:49, Jakob Homan<jg...@gmail.com>  wrote:
>>> Ack!, I suck.  Sorry.  I hadn't realized we'd gone through most of
>>> them, which itself is a good thing.  I'll get some new ones added
>>> first thing in the morning.  Sorry.
>> Do we have something around "document a workflow to get RDF graph data
>> into Giraph?". A few of us have been talking about it here or there,
>> and I've heard various strategies mentioned (e.g. Ntriples as it's a
>> simple line-oriented format; piggybacking on HBase or other storage
>> that Giraph already has adaptors for; integrating Apache Jena; ...). I
>> can't find much in JIRA but
>> https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue
>> (since we can't currently easily represent fully general RDF graphs
>> since two nodes might be connected by more than one typed edge). Even
>> without multigraphs it ought to be possible to bring RDF-sourced data
>> into Giraph, e.g. perhaps some app is only interested in say the
>> Movies + People subset of a big RDF collection. And so perhaps most of
>> the work is in preprocessing for now - e.g. via Ntriples + Pig; but
>> still it would be great to have a clear HOWTO.
>>
>> As an interested party on the periphery, a JIRA for this would give a
>> natural place to monitor, read up, maybe even help. And I'm sure I'm
>> not alone...
>>
>> cheers,
>>
>> Dan

Re: On helping new contributors pitch in quickly...

Posted by Paolo Castagna <ca...@googlemail.com>.

Hi Dan,
I have not an answer to your questions/observations yet.

However, I suspect N-Triples | N-Quads might not be the best option for
something like Giraph. Something more like an adjacency list might be
better.

So, my intuition, is that if you start with RDF in N-Triples format,
the first step would be a simple MapReduce job to group RDF statements
by subject (eventually filtering out certain properties):

Input:

  s1 --p1--> o1
  s1 --p2--> o2
  s1 --p2--> o3
  s2 ...

Output (adjacency list):

  s1 (p1 o1) (p2 o2) (p2 o3)
  s2 ...

But, as I said, is it too early for me to say definitely this is the
best approach.

Paolo

Dan Brickley wrote:
> On 5 April 2012 05:49, Jakob Homan <jg...@gmail.com> wrote:
>> Ack!, I suck.  Sorry.  I hadn't realized we'd gone through most of
>> them, which itself is a good thing.  I'll get some new ones added
>> first thing in the morning.  Sorry.
> 
> Do we have something around "document a workflow to get RDF graph data
> into Giraph?". A few of us have been talking about it here or there,
> and I've heard various strategies mentioned (e.g. Ntriples as it's a
> simple line-oriented format; piggybacking on HBase or other storage
> that Giraph already has adaptors for; integrating Apache Jena; ...). I
> can't find much in JIRA but
> https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue
> (since we can't currently easily represent fully general RDF graphs
> since two nodes might be connected by more than one typed edge). Even
> without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the
> Movies + People subset of a big RDF collection. And so perhaps most of
> the work is in preprocessing for now - e.g. via Ntriples + Pig; but
> still it would be great to have a clear HOWTO.
> 
> As an interested party on the periphery, a JIRA for this would give a
> natural place to monitor, read up, maybe even help. And I'm sure I'm
> not alone...
> 
> cheers,
> 
> Dan

Re: On helping new contributors pitch in quickly...

Posted by Jakob Homan <jg...@gmail.com>.

Sorry, took a couple days to get some time, but have now created 8 new
newbie JIRAs.  This should be enough for our new contributors to each
do a couple to get used the hang of contributing to Giraph.  Thanks
Paolo for the reminder!
-Jakob


On Thu, Apr 5, 2012 at 11:43 AM, Dan Brickley <da...@danbri.org> wrote:
> On 5 April 2012 17:05, Avery Ching <ac...@apache.org> wrote:
>> Dan, you're definitely right that this has been mentioned a few times.  The
>> multigraph issue is one part of it, but a helper VertexInputFormat (and
>> maybe VertexOutputFormat) would certainly still help as you mention.  Can
>> you please open a JIRA (and help if you have time)?
>
> Here you go: https://issues.apache.org/jira/browse/GIRAPH-170
>
> I've tried to summarise discussion from here and elsewhere.
>
> Dan

Re: On helping new contributors pitch in quickly...

Posted by Dan Brickley <da...@danbri.org>.

On 5 April 2012 17:05, Avery Ching <ac...@apache.org> wrote:
> Dan, you're definitely right that this has been mentioned a few times.  The
> multigraph issue is one part of it, but a helper VertexInputFormat (and
> maybe VertexOutputFormat) would certainly still help as you mention.  Can
> you please open a JIRA (and help if you have time)?

Here you go: https://issues.apache.org/jira/browse/GIRAPH-170

I've tried to summarise discussion from here and elsewhere.

Dan

Re: On helping new contributors pitch in quickly...

Posted by Avery Ching <ac...@apache.org>.

Dan, you're definitely right that this has been mentioned a few times.  
The multigraph issue is one part of it, but a helper VertexInputFormat 
(and maybe VertexOutputFormat) would certainly still help as you 
mention.  Can you please open a JIRA (and help if you have time)?

Avery

On 4/5/12 1:49 AM, Dan Brickley wrote:
> On 5 April 2012 05:49, Jakob Homan<jg...@gmail.com>  wrote:
>> Ack!, I suck.  Sorry.  I hadn't realized we'd gone through most of
>> them, which itself is a good thing.  I'll get some new ones added
>> first thing in the morning.  Sorry.
> Do we have something around "document a workflow to get RDF graph data
> into Giraph?". A few of us have been talking about it here or there,
> and I've heard various strategies mentioned (e.g. Ntriples as it's a
> simple line-oriented format; piggybacking on HBase or other storage
> that Giraph already has adaptors for; integrating Apache Jena; ...). I
> can't find much in JIRA but
> https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue
> (since we can't currently easily represent fully general RDF graphs
> since two nodes might be connected by more than one typed edge). Even
> without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the
> Movies + People subset of a big RDF collection. And so perhaps most of
> the work is in preprocessing for now - e.g. via Ntriples + Pig; but
> still it would be great to have a clear HOWTO.
>
> As an interested party on the periphery, a JIRA for this would give a
> natural place to monitor, read up, maybe even help. And I'm sure I'm
> not alone...
>
> cheers,
>
> Dan

Re: On helping new contributors pitch in quickly...

Posted by Dan Brickley <da...@danbri.org>.

On 5 April 2012 05:49, Jakob Homan <jg...@gmail.com> wrote:
> Ack!, I suck.  Sorry.  I hadn't realized we'd gone through most of
> them, which itself is a good thing.  I'll get some new ones added
> first thing in the morning.  Sorry.

Do we have something around "document a workflow to get RDF graph data
into Giraph?". A few of us have been talking about it here or there,
and I've heard various strategies mentioned (e.g. Ntriples as it's a
simple line-oriented format; piggybacking on HBase or other storage
that Giraph already has adaptors for; integrating Apache Jena; ...). I
can't find much in JIRA but
https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue
(since we can't currently easily represent fully general RDF graphs
since two nodes might be connected by more than one typed edge). Even
without multigraphs it ought to be possible to bring RDF-sourced data
into Giraph, e.g. perhaps some app is only interested in say the
Movies + People subset of a big RDF collection. And so perhaps most of
the work is in preprocessing for now - e.g. via Ntriples + Pig; but
still it would be great to have a clear HOWTO.

As an interested party on the periphery, a JIRA for this would give a
natural place to monitor, read up, maybe even help. And I'm sure I'm
not alone...

cheers,

Dan

Re: On helping new contributors pitch in quickly...

Posted by Jakob Homan <jg...@gmail.com>.

Ack!, I suck.  Sorry.  I hadn't realized we'd gone through most of
them, which itself is a good thing.  I'll get some new ones added
first thing in the morning.  Sorry.
-Jakob


On Wed, Apr 4, 2012 at 9:45 PM, Paolo Castagna
<ca...@googlemail.com> wrote:
> """
> To help new contributors pitch in quickly, we maintain a set of JIRAs [1] that
> focus on getting new contributors started with the mechanics of generating a
> patch — downloading the source, changing a couple lines, creating a patch,
> verifying its correctness, uploading it to JIRA and working with the community —
> rather that deep technical issues within Giraph itself. These are good issues
> with which to join the community.
> """
>
> This is nice, good idea indeed.
>
> Put more issues there (even if, at the moment, there does not seems to be much
> "simple" stuff that will get people started around). Things such as "port Giraph
> to YARN" or a new RPC layer are a bit scary for those just starting (like me). :-)
>
> Perhaps, another option is to increase number of examples. You already have a
> few interesting one, do you have one or two ideas on a couple of examples which
> could be added to Giraph?
>
> Paolo
>
>  [1] http://bit.ly/newbie_apache_giraph_issues