You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (Jira)" <ji...@apache.org> on 2021/11/28 14:28:00 UTC

[jira] [Comment Edited] (NUTCH-2838) Apache Tez integration

    [ https://issues.apache.org/jira/browse/NUTCH-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450043#comment-17450043 ] 

Lewis John McGibbney edited comment on NUTCH-2838 at 11/28/21, 2:27 PM:
------------------------------------------------------------------------

Hi [~abstractdog] thanks for commenting. (For me atleast) This is definitely on the Nutch roadmap. I did some initial experimentation which I documented at https://cwiki.apache.org/confluence/display/NUTCH/Running+Nutch+on+Tez
At that time I ran into issues with the code implementation because I was trying to have as little impact on the Nutch codebase as possible. That is to say, I was trying to avoid an entire re-write of all existing (~18) Nutch MapReduce jobs. 
That being said, once I finish up my current work on documenting [Nutch metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics], I will come back to this issue. 
One of the things I would _*like to do*_ is actually provide documentation for the Tez community to see how we went about migrating from MR to Tez... so I will continue to document things as they progress.
As of writing I don't want to take on too much work. I will come back to this task once my current work is finished.
As a wild request, I wonder if you would be interested in looking at the [Nutch Injector|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java] (basically NUTCH-2839) which is typically the first MR job in a Nutch crawl cycle. If you were able to provide some wisdom as to how you would evolve that MR job --> Tez it would be great to observe your engineering methodology. No problems if this is not possible I thought I would ask as a stretch :)


was (Author: lewismc):
Hi [~abstractdog] thanks for commenting. (For me atleast) This is definitely on the Nutch roadmap. I did some initial experimentation which I documented at https://cwiki.apache.org/confluence/display/NUTCH/Running+Nutch+on+Tez
At that time I ran into issues with the code implementation because I was trying to have as little impact on the Nutch codebase as possible. That is to say, I was trying to avoid an entire re-write of all existing (~18) Nutch MapReduce jobs. 
That being said, once I finish up my current work on documenting [Nutch metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics], I will come back to this issue. 
One of the things I would _*like to do*_ is actually provide documentation for the Tez community to see how we went about migrating from MR to Tez... so I will continue to document things as they go along.
As of writing I don't want to take on too much work. I will come back to this task once my current work is finished.
As a wild request, I wonder if you would be interested in looking at the [Nutch Injector|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java] (basically NUTCH-2839) which is the first MR job in a Nutch crawl cycle. If you were able to provide some wisdom as to how you would evolve that MR job --> Tez it would be great to observe your engineering methodology. No problems if this is not possible I thought I would ask as a stretch :)

> Apache Tez integration
> ----------------------
>
>                 Key: NUTCH-2838
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2838
>             Project: Nutch
>          Issue Type: New Feature
>          Components: deployment, runtime, tez
>    Affects Versions: 1.18
>            Reporter: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.19
>
>
> This is a parent epic under which all Tez integration tasks can be nested. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)