You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2015/11/14 11:28:17 UTC

Upgrade of mapred --> mapreduce in trunk e.g. Nutch 3.X

Hi Folks,

Mike Joyce and myself have been working on a Tinkerpop implementation of
Node and NodeDB (generated through WebGraph) which builds a Vertex input,
used by Tinkerpop, subsequently Gremlin and persisted into a graph database
such as TitanDB.
We have analyzed the problem quite a bit and came across the following I/O
formats
http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#script-io-format
I've implemented a PropertyWebGraphVertex writable in Nutch which builds
off of NodeDB (and others) to enable us to write out to
the ScriptOutputFormat. Essentially we address the issues of parent child
Vs child parent e.g. Outlinks Vs Inlinks respectively.
The work from there then consists of an external process (to Nutch)
invoking a Groovy script from within Gremlin to ingest data into TitanDB.
During the course of this work we have realized that mapred and mapreduce
API's are NOT ok within trunk if we want to move Nutch to accommodate the
above described architecture.

Breath of fresh air and a deep breath...

What do you guys think about branching trunk into a 3.X branch with every
mapred --> mapreduce package addressed.
Mike, Sujen and myself talked today. We want to touch base with everyone
within dev@ as it lends itself very much to the work undertaken by
https://issues.apache.org/jira/browse/NUTCH-2097

It does not however totally rearrange the codebase. It will however
generate a genuine graph output based upon
http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#script-io-format
We can have a gremlin script as part of $NUTCH_HOME/conf which merely
ingests data (along with a config file) to a GraphDB such as Titan.

What does everyone think?
Thanks
Lewis

-- 
*Lewis*

Re: Upgrade of mapred --> mapreduce in trunk e.g. Nutch 3.X

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

yes, let's discuss this, maybe off-line from this list
to get to the point quickly.

I'll be also happy to help (if I can).

Sebastian

On 11/18/2015 01:01 AM, Michael Joyce wrote:
> I would be happy to help with this (although that's probably obvious from the above email), but I
> realize we'll probably want to chat a bit about this. It's certainly not a small change =)
> 
> 
> -- Jimmy
> 
> On Sat, Nov 14, 2015 at 2:28 AM, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
> <ma...@gmail.com>> wrote:
> 
>     Hi Folks,
> 
>     Mike Joyce and myself have been working on a Tinkerpop implementation of Node and NodeDB
>     (generated through WebGraph) which builds a Vertex input, used by Tinkerpop, subsequently
>     Gremlin and persisted into a graph database such as TitanDB.
>     We have analyzed the problem quite a bit and came across the following I/O formats
>     http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#script-io-format
>     I've implemented a PropertyWebGraphVertex writable in Nutch which builds off of NodeDB (and
>     others) to enable us to write out to the ScriptOutputFormat. Essentially we address the issues
>     of parent child Vs child parent e.g. Outlinks Vs Inlinks respectively.
>     The work from there then consists of an external process (to Nutch) invoking a Groovy script
>     from within Gremlin to ingest data into TitanDB.
>     During the course of this work we have realized that mapred and mapreduce API's are NOT ok
>     within trunk if we want to move Nutch to accommodate the above described architecture.
> 
>     Breath of fresh air and a deep breath...
> 
>     What do you guys think about branching trunk into a 3.X branch with every mapred --> mapreduce
>     package addressed.
>     Mike, Sujen and myself talked today. We want to touch base with everyone within dev@ as it lends
>     itself very much to the work undertaken by https://issues.apache.org/jira/browse/NUTCH-2097
> 
>     It does not however totally rearrange the codebase. It will however generate a genuine graph
>     output based upon
>     http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#script-io-format
>     We can have a gremlin script as part of $NUTCH_HOME/conf which merely ingests data (along with a
>     config file) to a GraphDB such as Titan.
> 
>     What does everyone think?
>     Thanks
>     Lewis
> 
>     -- 
>     /Lewis/
> 
>

Re: Upgrade of mapred --> mapreduce in trunk e.g. Nutch 3.X

Posted by Michael Joyce <jo...@apache.org>.

I would be happy to help with this (although that's probably obvious from
the above email), but I realize we'll probably want to chat a bit about
this. It's certainly not a small change =)


-- Jimmy

On Sat, Nov 14, 2015 at 2:28 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Folks,
>
> Mike Joyce and myself have been working on a Tinkerpop implementation of
> Node and NodeDB (generated through WebGraph) which builds a Vertex input,
> used by Tinkerpop, subsequently Gremlin and persisted into a graph database
> such as TitanDB.
> We have analyzed the problem quite a bit and came across the following I/O
> formats
>
> http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#script-io-format
> I've implemented a PropertyWebGraphVertex writable in Nutch which builds
> off of NodeDB (and others) to enable us to write out to
> the ScriptOutputFormat. Essentially we address the issues of parent child
> Vs child parent e.g. Outlinks Vs Inlinks respectively.
> The work from there then consists of an external process (to Nutch)
> invoking a Groovy script from within Gremlin to ingest data into TitanDB.
> During the course of this work we have realized that mapred and mapreduce
> API's are NOT ok within trunk if we want to move Nutch to accommodate the
> above described architecture.
>
> Breath of fresh air and a deep breath...
>
> What do you guys think about branching trunk into a 3.X branch with every
> mapred --> mapreduce package addressed.
> Mike, Sujen and myself talked today. We want to touch base with everyone
> within dev@ as it lends itself very much to the work undertaken by
> https://issues.apache.org/jira/browse/NUTCH-2097
>
> It does not however totally rearrange the codebase. It will however
> generate a genuine graph output based upon
>
> http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#script-io-format
> We can have a gremlin script as part of $NUTCH_HOME/conf which merely
> ingests data (along with a config file) to a GraphDB such as Titan.
>
> What does everyone think?
> Thanks
> Lewis
>
> --
> *Lewis*
>