You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by lewis john mcgibbney <le...@apache.org> on 2020/12/10 07:46:30 UTC

[DISCUSS] Replacing MapReduce with Tez

Hi dev@,
A while ago I had thought about bringing this topic up... I then got
busy... for ages. I'll therefore get straight to the point.
Has anyone on the dev@ team had an experience using Apache Tez -
tez.apache.org?
Tez promises multiple improvements over MapReduce. Naturally I wondered
whether the Nutch project is at a stage of maturity now that we would look
to leverage something more performant than legacy MapReduce.
Were we to consider evolving Nutch by re-architecting it to use Tez as the
processing engine, this would be a significant work effort.
I just wanted to throw this out there for some blue-sky feedback.
Thanks
lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Re: [DISCUSS] Replacing MapReduce with Tez

Posted by Lewis John McGibbney <le...@apache.org>.
Hi dev@,
Short update here. I've documented my initial observations running Nutch on Tez at https://s.apache.org/viee3
Specific early finding are as follows
1. Counters don't appear to work... which makes sense as all existing counters are manifested using the MapReduce framework. I'm not sure if Tez has a similar/equivalent concept of counters but I am working to find out more.
2. So far running some basic experiments using the Injector job on around ~12k URLs, I've observed the following
- When 'mapreduce.framework.name' is set to 'yarn-tez' I am observing the following runtimes
  * 1st run: elapsed: 00:00:42
  * 2nd run: elapsed: 00:00:13
  * 3rd run: elapsed: 00:00:14

- When 'mapreduce.framework.name' is set to 'yarn' I am observing the following runtimes
  * 1st run: elapsed: 00:00:34
  * 2nd run: elapsed: 00:00:32
  * 3rd run: elapsed: 00:00:34

So after the first run, it looks like running the Injector job on Tez results in a dramatic runtime improvement.

As I mentioned in the Tez thread, I'm going to document all of this on the Nutch wiki. I also plan to  continue my evaluation over the holidays and will report back here when I have more information. 

Thanks

On 2020/12/10 07:46:30, lewis john mcgibbney <le...@apache.org> wrote: 
> Hi dev@,
> A while ago I had thought about bringing this topic up... I then got
> busy... for ages. I'll therefore get straight to the point.
> Has anyone on the dev@ team had an experience using Apache Tez -
> tez.apache.org?
> Tez promises multiple improvements over MapReduce. Naturally I wondered
> whether the Nutch project is at a stage of maturity now that we would look
> to leverage something more performant than legacy MapReduce.
> Were we to consider evolving Nutch by re-architecting it to use Tez as the
> processing engine, this would be a significant work effort.
> I just wanted to throw this out there for some blue-sky feedback.
> Thanks
> lewismc
> 
> -- 
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
> 

Re: [DISCUSS] Replacing MapReduce with Tez

Posted by Lewis John McGibbney <le...@apache.org>.
Hi dev@,
I've documented my Tez journey so far at https://cwiki.apache.org/confluence/display/NUTCH/Running+Nutch+on+Tez
Things are getting quite interesting. 
Please share any experiences using Nutch on Tez or improvements to the documentation especially any experiments you can document.
Thank you

On 2020/12/10 07:46:30, lewis john mcgibbney <le...@apache.org> wrote: 
> Hi dev@,
> A while ago I had thought about bringing this topic up... I then got
> busy... for ages. I'll therefore get straight to the point.
> Has anyone on the dev@ team had an experience using Apache Tez -
> tez.apache.org?
> Tez promises multiple improvements over MapReduce. Naturally I wondered
> whether the Nutch project is at a stage of maturity now that we would look
> to leverage something more performant than legacy MapReduce.
> Were we to consider evolving Nutch by re-architecting it to use Tez as the
> processing engine, this would be a significant work effort.
> I just wanted to throw this out there for some blue-sky feedback.
> Thanks
> lewismc
> 
> -- 
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
> 

Re: [DISCUSS] Replacing MapReduce with Tez

Posted by BlackIce <bl...@gmail.com>.
Sounds interesting

On Thu, Dec 10, 2020, 10:29 <sh...@gmail.com> wrote:

> Hi Lewis,
>
> I have some background using and developing on tez for sometime. It might
> be good improvement if we can fit our model to Apache Tez. If dev@ agrees
> and it is feasible I will be happy to check and work on this.
>
> Thanks,
> Shashanka
>
> Sent from my iPhone
>
> On 10-Dec-2020, at 1:16 PM, lewis john mcgibbney <le...@apache.org>
> wrote:
>
> 
> Hi dev@,
> A while ago I had thought about bringing this topic up... I then got
> busy... for ages. I'll therefore get straight to the point.
> Has anyone on the dev@ team had an experience using Apache Tez -
> tez.apache.org?
> Tez promises multiple improvements over MapReduce. Naturally I wondered
> whether the Nutch project is at a stage of maturity now that we would look
> to leverage something more performant than legacy MapReduce.
> Were we to consider evolving Nutch by re-architecting it to use Tez as the
> processing engine, this would be a significant work effort.
> I just wanted to throw this out there for some blue-sky feedback.
> Thanks
> lewismc
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>
>

Re: [DISCUSS] Replacing MapReduce with Tez

Posted by sh...@gmail.com.
Hi Lewis,

I have some background using and developing on tez for sometime. It might be good improvement if we can fit our model to Apache Tez. If dev@ agrees and it is feasible I will be happy to check and work on this. 

Thanks,
Shashanka

Sent from my iPhone

> On 10-Dec-2020, at 1:16 PM, lewis john mcgibbney <le...@apache.org> wrote:
> 
> 
> Hi dev@,
> A while ago I had thought about bringing this topic up... I then got busy... for ages. I'll therefore get straight to the point.
> Has anyone on the dev@ team had an experience using Apache Tez - tez.apache.org?
> Tez promises multiple improvements over MapReduce. Naturally I wondered whether the Nutch project is at a stage of maturity now that we would look to leverage something more performant than legacy MapReduce.
> Were we to consider evolving Nutch by re-architecting it to use Tez as the processing engine, this would be a significant work effort.
> I just wanted to throw this out there for some blue-sky feedback.
> Thanks
> lewismc
> 
> -- 
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc