You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2018/01/24 20:55:00 UTC

[jira] [Updated] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph

     [ https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-2369:
----------------------------------------
    Labels: gsoc2017 gsoc2018  (was: gsoc2017)

> Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
> ------------------------------------------------------------------------------
>
>                 Key: NUTCH-2369
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2369
>             Project: Nutch
>          Issue Type: Task
>          Components: crawldb, graphgenerator, hostdb, linkdb, segment, storage, tool
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>              Labels: gsoc2017, gsoc2018
>             Fix For: 1.15
>
>
> I've been thinking for quite some time now that a new Tool which writes Nutch data out as full graph data would be an excellent addition to the codebase.
> My thoughts involves writing data using Tinkerpop's ScriptInputFormat and ScriptOutputFormat's to create Vertex objects representing Nutch Crawl Records. 
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html
> I envisage that each Vertex object would require the CrawlDB, LinkDB a Segment and possibly the HostDB in order to be fully populated. Graph characteristics e.g. Edge's would comes from those existing data structures as well.
> It is my intention to propose this as a GSoC project for 2017 and I have already talked offline with a potential student [~omkar20895] about him participating as the student.
> Essentially, if we were able to create a Graph enabling true traversal, this could be a game changer for how Nutch Crawl data is interpreted. It is my feeling that this issue most likely also involved an entire upgrade of the Hadoop API's from mapred to mapreduce for the master codebase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)