You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Lewis John McGibbney <le...@apache.org> on 2020/12/11 02:40:33 UTC

Porting legacy MapReduce application to Tez

Hi user@,

First a couple of things
1. Thanks to Jonathan Eagles (who I spoke to offlist) for explaining a bit about the Tez community. When I looked through the mailing lists, even thought you guys just made a release, I wasn't sure if the project was alive and kicking. Thanks Jonathan for confirming.
2. Based on my digging through documentation and YouTube videos, I pulled together TEZ-4257 [0] and the corresponding pull request [1]. I also saw that the TravisCI build was broken so I produced TEZ-4258.

Now, the important stuff... I'm a long time developer of the Apache Nutch project [2]; a well matured, production ready Web crawler. Nutch relyies on Apache Hadoop data structures relying heavily on MapReduce.

A typical Nutch crawl lifecycle involves the following steps
* inject - from a seed list either create or inject entries into an existing crawl database
* generate - fetch lists from suitable entries present within the crawl database
* fetch - URL partitions
* parse - extract data and metadata from the fetched content
* updatedb - based upon what was fetched, update the crawl database
* wash, rinse repeat (there are other steps like indexing however for simplicity lets leave those steps out for now)

I recently started a thread over on the Nutch dev@ list to see if there is any interest in investigating what it would take to evolve Nutch from MapReduce --> Tez. In order to understand the programming model I looked at the Tez Javadoc and examples both of which have been useful.

I suppose I have one basic question. Given my brief explanation of the crawl cycle above, should I be looking to implement just one DAG covering the entire crawl cycle? Or something else?

Currently we automate the crawl cycle via a bash script with each step executed in sequence. There are several appealing reasons why an explicit data flow programming model would be advantageous but I just need clarity on the correct approach.

Thank you for any assistance.
lewismc

[0] https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4257
[1] https://github.com/apache/tez/pull/82
[2] http://nutch.apache.org

Re: Porting legacy MapReduce application to Tez

Posted by Lewis John McGibbney <le...@apache.org>.

Hi Zhiyuan,
Thanks for your response.

On 2020/12/11 03:51:17, Zhiyuan Yang <zh...@apache.org> wrote: 
> I think the first step can be simply trying replacing what you currently
> have in MapReduce with Tez,

I'm working on understanding how to do that :)
I'm going to start with the Injector tool.
The InjectorMapper [0] which reads (i) the crawl database seeds are injected into, and (ii) a plain-text seed file, parsing each line in a particular way. Depending on configuration and command-line parameters the URLs are normalized and filtered using the configured plugins.
The InjectorReducer [1] combines multiple new entries for a url based on some logical rules.
The result is the updated crawl database serialized as the MapFileOutputFormat where output keys are  of type org.apache.hadoop.io.Text and values a custom CrawlDatum type [2] which represents the crawl state of a url.

Do you have any advice on how one would go about transforming the above job into a Tez DAG?

[0] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L105-L269
[1] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L271-L349
[2] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java

> instead of trying to integrate the entire data
> flow into Tez in a single step. My concern is whether there are some
> unpopular MapReduce features you rely on but are not supported by Tez yet.
> 

I would not be surprised. I will only really know this once I encounter something (which I hope does not happen).

Thanks again for any thoughts you have.
lewismc

Re: Porting legacy MapReduce application to Tez

Posted by Lewis John McGibbney <le...@apache.org>.

Some more observations

4. When 'mapreduce.framework.name' is set to yarn, counters are present for the Injector job. The following is example output showing all of the Injector job counters. 

2020-12-21 12:13:58,674 INFO mapreduce.Job: Job job_1608581566698_0001 completed successfully
2020-12-21 12:13:58,760 INFO mapreduce.Job: Counters: 52
	File System Counters
		FILE: Number of bytes read=1456826
		FILE: Number of bytes written=3699396
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=1163333
		HDFS: Number of bytes written=794148
		HDFS: Number of read operations=15
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=6
	Job Counters
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=19002
		Total time spent by all reduces in occupied slots (ms)=7305
		Total time spent by all map tasks (ms)=6334
		Total time spent by all reduce tasks (ms)=2435
		Total vcore-milliseconds taken by all map tasks=6334
		Total vcore-milliseconds taken by all reduce tasks=2435
		Total megabyte-milliseconds taken by all map tasks=9501000
		Total megabyte-milliseconds taken by all reduce tasks=3652500
	Map-Reduce Framework
		Map input records=22845
		Map output records=22765
		Map output bytes=1411290
		Map output materialized bytes=1456832
		Input split bytes=572
		Combine input records=0
		Combine output records=0
		Reduce input groups=11322
		Reduce shuffle bytes=1456832
		Reduce input records=22765
		Reduce output records=11322
		Spilled Records=45530
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=114
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=1046478848
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	injector
		urls_filtered=80
		urls_injected=11443
		urls_merged=11322
	File Input Format Counters
		Bytes Read=0
	File Output Format Counters
		Bytes Written=794148
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls rejected by filters: 80
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls injected after normalization and filtering: 11443
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls injected but already in CrawlDb: 11322
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total new urls injected: 121
2020-12-21 12:13:58,794 INFO crawl.Injector: Injector: Total urls with status gone removed from CrawlDb (db.update.purge.404): 0
2020-12-21 12:13:58,804 INFO crawl.Injector: Injector: finished at 2020-12-21 12:13:58, elapsed: 00:00:36

5. When 'mapreduce.framework.name' is set to 'yarn-tez' I am observing the following runtimes
  * 1st run: elapsed: 00:00:42
  * 2nd run: elapsed: 00:00:13
  * 3rd run: elapsed: 00:00:14

5. When 'mapreduce.framework.name' is set to 'yarn' I am observing the following runtimes
  * 1st run: elapsed: 00:00:34
  * 2nd run: elapsed: 00:00:32
  * 3rd run: elapsed: 00:00:34

After the first DAG run, it looks like there is a marked runtime improvement running the Injector job on Tez. It would be great if I could somehow explain this but my Tez knowledge is still very limited. I will keep digging though!

I'm also going to create a tutorial on the Nutch wiki covering this entire experience. I'll maybe pull together a YouTube video or something as well.

lewismc

On 2020/12/21 20:11:59, Lewis John McGibbney <le...@apache.org> wrote: 
> Hi László,
> Thank you for the additional explanation. Adapting my configuration based on your suggestions results in successful job execution as DAG's now. A huge thank you :)
> 
> A couple of notes
> 1. I was unable to use the Tez minimal distribution. I had to use the full 0.10.0-SNAPSHOT due to the absence of the jetty-http-9.4.20.v20190813.jar dependency in minimal distribution.
> 2. Looking at the syslog for the DAG, I have observed java.lang.NoSuchMethodException: java.nio.channels.ClosedByInterruptException. The full paste can be seen at https://paste.apache.org/mjw0c. Is this normal behavior? 
> 3. We added some useful counters to the Injector job which are printed to the application log. Example is as follows
> 
> 2020-12-21 12:06:35,242 INFO mapreduce.Job: Job job_1608580287657_0003 completed successfully
> 2020-12-21 12:06:35,249 INFO mapreduce.Job: Counters: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls rejected by filters: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected after normalization and filtering: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected but already in CrawlDb: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total new urls injected: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls with status gone removed from CrawlDb (db.update.purge.404): 0
> 2020-12-21 12:06:35,293 INFO crawl.Injector: Injector: finished at 2020-12-21 12:06:35, elapsed: 00:00:14
> 
> As you can see, apparently the counters are not correctly representing the relevant entries in the newly created crawl database. I can verify this by running the following read database job
> 
> nutch readdb crawldb -stats
> 
> 2020-12-21 12:08:06,558 INFO mapreduce.Job: Counters: 0
> 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawldb
> 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: TOTAL urls:	11322
> 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: shortest fetch interval:	30 days, 00:00:00
> 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: avg fetch interval:	30 days, 00:00:00
> 2020-12-21 12:08:06,598 INFO crawl.CrawlDbReader: longest fetch interval:	30 days, 00:00:00
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: earliest fetch time:	Mon Dec 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: avg of fetch times:	Mon Dec 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: latest fetch time:	Mon Dec 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: retry 0:	11322
> 2020-12-21 12:08:06,605 INFO crawl.CrawlDbReader: score quantile 0.01:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.05:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.1:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.2:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.25:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.3:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.4:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.5:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.6:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.7:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.75:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.8:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.9:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.95:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.99:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: min score:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: avg score:	1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: max score:	1.0
> 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: status 1 (db_unfetched):	11322
> 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: CrawlDb statistics: done
> 
> I'm investigating this now.
> 
> Again, thank you very much for your help.
> 
> On 2020/12/21 00:04:36, László Bodor <bo...@gmail.com> wrote: 
> > Hi!
> > 
> > This is how I made it work (hadoop 3.1.3, tez 0.10.0), attached to drive:
> > here
> > <https://drive.google.com/file/d/1eFMUPSxFpJ0p7fi7IrsI3HAACa4m5s7n/view?usp=sharing>
> > 
> > 1.
> > hdfs dfs -mkdir -p /apps/tez
> > hdfs dfs -put ~/Applications/apache/tez/tez.tar.gz /apps/tez
> > 
> ..
>

Re: Porting legacy MapReduce application to Tez

Posted by Rohini Palaniswamy <ro...@apache.org>.

The nutch jars have to be added to the distributed cache (
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#DistributedCache)
for it to be available in classpath in the tasks. Distributed cache is a
mapreduce terminology (from hadoop 1.x). With YARN (hadoop 2.x) the
implementation is via LocalResource (
https://blog.cloudera.com/resource-localization-in-yarn-deep-dive/). In Tez
user APIs you will find LocalResource while mapreduce still maintains the
original user apis of DistributedCache with the underlying implementation
being LocalResource.

The hadoop jar command takes care of adding the jar in the command to the
distributed cache. Any additional files need to be shipped with
-files/-libjars/-archives option (
https://hadoop.apache.org/docs/r1.0.4/commands_manual.html#Generic+Options)
or using the settings mapreduce.job.cache.{files|archives}. yarn-tez mode
also honors the mapreduce.job.cache.{files|archives} settings. So instead
of adding it to tez.lib.uris.classpath, you can specify via those settings.

Just a heads up. Tez is slightly more low level and was meant to be used by
frameworks like Pig, Hive, Cascading, etc and so building a Tez application
DAG from scratch is going to be more code and not as straightforward as
writing a mapper and reducer job. But it does come with a lot of
flexibility and ability to customize the DAG and can make a big difference
for some applications. For eg: Twitter folks extended it and used for an
application to do custom partitioning and routing of data (
https://issues.apache.org/jira/browse/TEZ-3209). Below are some classes
from Pig and Hive where dags are constructed to give you a general idea.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezSessionManager.java
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java

Regards,
Rohini

On Mon, Dec 21, 2020 at 12:50 PM Lewis John McGibbney <le...@apache.org>
wrote:

> I found the Tez Counters package
>
> https://tez.apache.org/releases/0.9.2/tez-api-javadocs/index.html?org/apache/tez/common/counters/package-summary.html
> I'm going to experiment adapting the Injector job to use this package
> rather than the legacy Map and Reduce Context objects.
>
> On 2020/12/21 20:11:59, Lewis John McGibbney <le...@apache.org> wrote:
> > Hi László,
> > Thank you for the additional explanation. Adapting my configuration
> based on your suggestions results in successful job execution as DAG's now.
> A huge thank you :)
> >
> ..
>

Re: Porting legacy MapReduce application to Tez

Posted by Lewis John McGibbney <le...@apache.org>.

I found the Tez Counters package
https://tez.apache.org/releases/0.9.2/tez-api-javadocs/index.html?org/apache/tez/common/counters/package-summary.html
I'm going to experiment adapting the Injector job to use this package rather than the legacy Map and Reduce Context objects.

On 2020/12/21 20:11:59, Lewis John McGibbney <le...@apache.org> wrote: 
> Hi László,
> Thank you for the additional explanation. Adapting my configuration based on your suggestions results in successful job execution as DAG's now. A huge thank you :)
> 
..

Re: Porting legacy MapReduce application to Tez

Posted by Lewis John McGibbney <le...@apache.org>.

Hi László,
Thank you for the additional explanation. Adapting my configuration based on your suggestions results in successful job execution as DAG's now. A huge thank you :)

A couple of notes
1. I was unable to use the Tez minimal distribution. I had to use the full 0.10.0-SNAPSHOT due to the absence of the jetty-http-9.4.20.v20190813.jar dependency in minimal distribution.
2. Looking at the syslog for the DAG, I have observed java.lang.NoSuchMethodException: java.nio.channels.ClosedByInterruptException. The full paste can be seen at https://paste.apache.org/mjw0c. Is this normal behavior? 
3. We added some useful counters to the Injector job which are printed to the application log. Example is as follows

2020-12-21 12:06:35,242 INFO mapreduce.Job: Job job_1608580287657_0003 completed successfully
2020-12-21 12:06:35,249 INFO mapreduce.Job: Counters: 0
2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls rejected by filters: 0
2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected after normalization and filtering: 0
2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected but already in CrawlDb: 0
2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total new urls injected: 0
2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls with status gone removed from CrawlDb (db.update.purge.404): 0
2020-12-21 12:06:35,293 INFO crawl.Injector: Injector: finished at 2020-12-21 12:06:35, elapsed: 00:00:14

As you can see, apparently the counters are not correctly representing the relevant entries in the newly created crawl database. I can verify this by running the following read database job

nutch readdb crawldb -stats

2020-12-21 12:08:06,558 INFO mapreduce.Job: Counters: 0
2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawldb
2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: TOTAL urls:	11322
2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: shortest fetch interval:	30 days, 00:00:00
2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: avg fetch interval:	30 days, 00:00:00
2020-12-21 12:08:06,598 INFO crawl.CrawlDbReader: longest fetch interval:	30 days, 00:00:00
2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: earliest fetch time:	Mon Dec 21 12:06:00 PST 2020
2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: avg of fetch times:	Mon Dec 21 12:06:00 PST 2020
2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: latest fetch time:	Mon Dec 21 12:06:00 PST 2020
2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: retry 0:	11322
2020-12-21 12:08:06,605 INFO crawl.CrawlDbReader: score quantile 0.01:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.05:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.1:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.2:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.25:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.3:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.4:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.5:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.6:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.7:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.75:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.8:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.9:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.95:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.99:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: min score:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: avg score:	1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: max score:	1.0
2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: status 1 (db_unfetched):	11322
2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: CrawlDb statistics: done

I'm investigating this now.

Again, thank you very much for your help.

On 2020/12/21 00:04:36, László Bodor <bo...@gmail.com> wrote: 
> Hi!
> 
> This is how I made it work (hadoop 3.1.3, tez 0.10.0), attached to drive:
> here
> <https://drive.google.com/file/d/1eFMUPSxFpJ0p7fi7IrsI3HAACa4m5s7n/view?usp=sharing>
> 
> 1.
> hdfs dfs -mkdir -p /apps/tez
> hdfs dfs -put ~/Applications/apache/tez/tez.tar.gz /apps/tez
> 
..

Re: Porting legacy MapReduce application to Tez

Posted by László Bodor <bo...@gmail.com>.

Hi!

This is how I made it work (hadoop 3.1.3, tez 0.10.0), attached to drive:
here
<https://drive.google.com/file/d/1eFMUPSxFpJ0p7fi7IrsI3HAACa4m5s7n/view?usp=sharing>

1.
hdfs dfs -mkdir -p /apps/tez
hdfs dfs -put ~/Applications/apache/tez/tez.tar.gz /apps/tez

hdfs dfs -mkdir /nutch
hdfs dfs -put nutch.tar.gz /nutch

hdfs dfs -mkdir /user/$USER/
echo "https://www.jpl.nasa.gov/news/" > seed.txt
hdfs dfs -mkdir -p /user/$USER/urls
hdfs dfs -put seed.txt /user/$USER/urls #some examples
hadoop jar apache-nutch-1.18-SNAPSHOT.jar org.apache.nutch.crawl.Injector
crawldb /user/$USER/urls
hadoop jar apache-nutch-1.18-SNAPSHOT.jar org.apache.nutch.crawl.CrawlDb
crawldb

2. Some notes
tez-site.xml that I used is included into the package, but its content is:
<property>
<name>tez.lib.uris</name>
<value>/apps/tez/tez.tar.gz#tez,/nutch/nutch.tar.gz#nutch</value>
</property>
<property>
<name>tez.lib.uris.classpath</name>
<value>./tez/*,./tez/lib/*,./nutch/nutch/*,./nutch/nutch/lib/*,./nutch/nutch/classes/plugins/</value>
</property>
<property>
<name>tez.use.cluster.hadoop-libs</name>
<value>false</value>
</property>
<property>
<name>plugin.folders</name>
<value>nutch/nutch/classes/plugins</value>
</property>
I needed to create a nutch.tar.gz archive, this way tez was able to
localize it through tez.lib.uris to containers running yarn. Jar files are
not decompressed, so for instance lib folder inside jar won't be on
classpath (docs
<https://tez.apache.org/releases/0.9.2/tez-api-javadocs/configs/TezConfiguration.html>)
*tez.lib.uris.classpath*: be aware where the beautiful "/nutch/nutch/*"
comes from (it's just my quick repro): first "/nutch" is because
tez.lib.uris localizes nutch.tar.gz into there according to "#nutch", the
second is because my nutch.tar.gz has an inner structure of "/nutch", so
files end up being copied to PWD/nutch/nutch/* for the container...
you're free to create a nutch.tar.gz without a root folder inside, and
configure accordingly in order to have prettier paths.

*plugin.folders*: plugins should also be pointed properly to the localized
path

Attached tez app logs for a successful injector run.

Regards,
Laszlo Bodor

On Sun, 20 Dec 2020 at 06:38, Lewis John McGibbney <le...@apache.org>
wrote:

> Hi Jonathan,
>
> Thank you for the response. This is very useful.
>
> Using your configuration I am able to execute the Tez examples no problem.
> The issue is when i attempt to run Nutch. No matter what I've tried, the
> dependencies for Nutch are never found.
> I've tried building a binary .tar.gz distribution of Nutch and referencing
> it's URI on HDFS... this does not work and I get ClassNotFound exceptions.
> I've tried referencing the Nutch .job artifact which contains all
> dependencies... this does not work.
>
> Just to confirm, I can successfully execute all Nutch jobs when '
> mapreduce.framework.name' value is set to 'yarn'. We execute the jobs as
> follows
>
> hadoop jar ${NUTCH.job} $CLASS $arguments
>
> I feel like I am very close to getting this running. I wonder if someone
> on this list could make an attempt at running a job and seeing if they can
> reproduce? I've uploaded the compiled .job and the nutch bash script at
> https://drive.google.com/drive/folders/1yjGi8UWVZithcYWLgUINm9v6IU2Scmy5?usp=sharing
>
> You can execute the Injector tool by running
>
> ./nutch inject crawldb urls //assuming that urls is a directory on HDFS
> containing a simple text file with one URL entry i.e.
> http://tez.apache.org
>
> Again, thank you to you all for any further direction. I am really keen to
> get Nutch running on Tez.
>
> lewismc
>
> On 2020/12/17 18:09:02, Jonathan Eagles <je...@gmail.com> wrote:
> > This is what I use in production that has many benefits. In this case
> > mapreduce.application.framework.path is the runtime classpath tar.gz file
> > that is custom built mapreduce runtime environment, perhaps similar to
> nutch
> > 1) localizing one tar.gz file instead of many individual jars
> > 2) minimal jar has fewer class conflicts and a smaller footprint
> > 3) localizing tez to tez folder (#tez) allows better control of the
> > classpath to avoid java inconsistent classpath resolution of jars in same
> > directory
> > 4) use cluster hadooplibs false avoids using the jars from the
> individuals
> > nodemanagers and only relies on jars listed in tez.lib.uris
> >
> >   <property>
> >     <name>mapreduce.application.framework.path</name>
> >
> >
> <value>/hdfs/path/hadoop-mapreduce-${mapreduce.application.framework.version}.tgz#hadoop-mapreduce</value>
> >   </property>
> >
> >   <property>
> >     <name>tez.lib.uris</name>
> >
> >
> <value>/hdfs/path/tez-0.9.2-minimal.tar.gz#tez,${mapreduce.application.framework.path}</value>
> >   </property>
> >   <property>
> >     <name>tez.lib.uris.classpath</name>
> >     <value>${mapreduce.application.classpath},./tez/*,./tez/lib/*</value>
> >   </property>
> >   <property>
> >     <name>tez.use.cluster.hadoop-libs</name>
> >     <value>false</value>
> >   </property>
> >
> > On Thu, Dec 17, 2020 at 11:57 AM Lewis John McGibbney <
> lewismc@apache.org>
> > wrote:
> >
> > > I tried the following configuration in tez-site.xml with no luck
> > >
> > > <configuration>
> > > <property>
> > >   <name>tez.lib.uris</name>
> > >
> > >
> <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> > > </property>
> > >
> > > <property>
> > >   <name>tez.lib.uris.classpath</name>
> > >
>  <value>${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> > > </property>
> > > </configuration>
> > >
> > > On 2020/12/17 17:35:28, Lewis John McGibbney <le...@apache.org>
> wrote:
> > > > Hi Zhiyuan,
> > > > Thanks for the guidance. I'm making progress but I am still battling
> > > initial configuration management issues.
> > > > I'm running HDFS and YARN v3.1.4 in pseudo-mode.
> > > > My tez-site.xml contains the following content
> > > >
> > > > <configuration>
> > > > <property>
> > > >   <name>tez.lib.uris</name>
> > > >
> > >
> <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch</value>
> > > > </property>
> > > > </configuration>
> > > >
> > > > N.B. When I attempted to use the compressed Tez tar.gz, I was running
> > > into classpath issues which are largely documented in the installation
> > > documentation you pointed me to. I overcame these issues by simply
> > > uploading the minimal directory. All seems fine at this stage as I can
> run
> > > all of the Tez examples.
> > > >
> > > > I run into trouble when I try to run any job from the Nutch
> application.
> > > For example when I run the Injector one of the Nutch plugin extension
> > > points (x point org.apache.nutch.net.URLNormalizer) cannot be not
> found.
> > > The relevant log can be seen at https://paste.apache.org/4whoe.
> > > > I should note that the entire Nutch .job is available on HDFS at the
> URI
> > > defined in the tez-site.xml above.
> > > >
> > > > The output of jar -tf on the nutch.job artifact can be seen at
> > > https://paste.apache.org/hl8tk.
> > > > Am I required to somehow describe the structural heirarchy of this
> > > artifact in the tez.lib.uris.classpath configuration property?
> > > >
> > > > Thank you again for any guidance.
> > > >
> > > > lewismc
> > > >
> > > > On 2020/12/14 03:23:48, Zhiyuan Yang <zh...@apache.org> wrote:
> > > > > Hi Lewis,
> > > > >
> > > > > If there is no incompatibility, your existing job will run well on
> Tez
> > > > > without code change. You can just follow this guide
> > > > > <https://tez.apache.org/install.html> (especially step 4) to try
> it
> > > out.
> > > > >
> > > > > Thanks,
> > > > > Zhiyuan
> > > > >
> > > > > On Mon, Dec 14, 2020 at 9:04 AM Lewis John McGibbney <
> > > lewismc@apache.org>
> > > > > wrote:
> > > > >
> > > >
> > > >
> > >
> >
>

Re: Porting legacy MapReduce application to Tez

Posted by Lewis John McGibbney <le...@apache.org>.

Hi Jonathan,

Thank you for the response. This is very useful.

Using your configuration I am able to execute the Tez examples no problem. The issue is when i attempt to run Nutch. No matter what I've tried, the dependencies for Nutch are never found.
I've tried building a binary .tar.gz distribution of Nutch and referencing it's URI on HDFS... this does not work and I get ClassNotFound exceptions. I've tried referencing the Nutch .job artifact which contains all dependencies... this does not work. 

Just to confirm, I can successfully execute all Nutch jobs when 'mapreduce.framework.name' value is set to 'yarn'. We execute the jobs as follows

hadoop jar ${NUTCH.job} $CLASS $arguments

I feel like I am very close to getting this running. I wonder if someone on this list could make an attempt at running a job and seeing if they can reproduce? I've uploaded the compiled .job and the nutch bash script at https://drive.google.com/drive/folders/1yjGi8UWVZithcYWLgUINm9v6IU2Scmy5?usp=sharing

You can execute the Injector tool by running 

./nutch inject crawldb urls //assuming that urls is a directory on HDFS containing a simple text file with one URL entry i.e. http://tez.apache.org

Again, thank you to you all for any further direction. I am really keen to get Nutch running on Tez.

lewismc

On 2020/12/17 18:09:02, Jonathan Eagles <je...@gmail.com> wrote: 
> This is what I use in production that has many benefits. In this case
> mapreduce.application.framework.path is the runtime classpath tar.gz file
> that is custom built mapreduce runtime environment, perhaps similar to nutch
> 1) localizing one tar.gz file instead of many individual jars
> 2) minimal jar has fewer class conflicts and a smaller footprint
> 3) localizing tez to tez folder (#tez) allows better control of the
> classpath to avoid java inconsistent classpath resolution of jars in same
> directory
> 4) use cluster hadooplibs false avoids using the jars from the individuals
> nodemanagers and only relies on jars listed in tez.lib.uris
> 
>   <property>
>     <name>mapreduce.application.framework.path</name>
> 
> <value>/hdfs/path/hadoop-mapreduce-${mapreduce.application.framework.version}.tgz#hadoop-mapreduce</value>
>   </property>
> 
>   <property>
>     <name>tez.lib.uris</name>
> 
> <value>/hdfs/path/tez-0.9.2-minimal.tar.gz#tez,${mapreduce.application.framework.path}</value>
>   </property>
>   <property>
>     <name>tez.lib.uris.classpath</name>
>     <value>${mapreduce.application.classpath},./tez/*,./tez/lib/*</value>
>   </property>
>   <property>
>     <name>tez.use.cluster.hadoop-libs</name>
>     <value>false</value>
>   </property>
> 
> On Thu, Dec 17, 2020 at 11:57 AM Lewis John McGibbney <le...@apache.org>
> wrote:
> 
> > I tried the following configuration in tez-site.xml with no luck
> >
> > <configuration>
> > <property>
> >   <name>tez.lib.uris</name>
> >
> > <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> > </property>
> >
> > <property>
> >   <name>tez.lib.uris.classpath</name>
> >   <value>${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> > </property>
> > </configuration>
> >
> > On 2020/12/17 17:35:28, Lewis John McGibbney <le...@apache.org> wrote:
> > > Hi Zhiyuan,
> > > Thanks for the guidance. I'm making progress but I am still battling
> > initial configuration management issues.
> > > I'm running HDFS and YARN v3.1.4 in pseudo-mode.
> > > My tez-site.xml contains the following content
> > >
> > > <configuration>
> > > <property>
> > >   <name>tez.lib.uris</name>
> > >
> >  <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch</value>
> > > </property>
> > > </configuration>
> > >
> > > N.B. When I attempted to use the compressed Tez tar.gz, I was running
> > into classpath issues which are largely documented in the installation
> > documentation you pointed me to. I overcame these issues by simply
> > uploading the minimal directory. All seems fine at this stage as I can run
> > all of the Tez examples.
> > >
> > > I run into trouble when I try to run any job from the Nutch application.
> > For example when I run the Injector one of the Nutch plugin extension
> > points (x point org.apache.nutch.net.URLNormalizer) cannot be not found.
> > The relevant log can be seen at https://paste.apache.org/4whoe.
> > > I should note that the entire Nutch .job is available on HDFS at the URI
> > defined in the tez-site.xml above.
> > >
> > > The output of jar -tf on the nutch.job artifact can be seen at
> > https://paste.apache.org/hl8tk.
> > > Am I required to somehow describe the structural heirarchy of this
> > artifact in the tez.lib.uris.classpath configuration property?
> > >
> > > Thank you again for any guidance.
> > >
> > > lewismc
> > >
> > > On 2020/12/14 03:23:48, Zhiyuan Yang <zh...@apache.org> wrote:
> > > > Hi Lewis,
> > > >
> > > > If there is no incompatibility, your existing job will run well on Tez
> > > > without code change. You can just follow this guide
> > > > <https://tez.apache.org/install.html> (especially step 4) to try it
> > out.
> > > >
> > > > Thanks,
> > > > Zhiyuan
> > > >
> > > > On Mon, Dec 14, 2020 at 9:04 AM Lewis John McGibbney <
> > lewismc@apache.org>
> > > > wrote:
> > > >
> > >
> > >
> >
>

Re: Porting legacy MapReduce application to Tez

Posted by Jonathan Eagles <je...@gmail.com>.

This is what I use in production that has many benefits. In this case
mapreduce.application.framework.path is the runtime classpath tar.gz file
that is custom built mapreduce runtime environment, perhaps similar to nutch
1) localizing one tar.gz file instead of many individual jars
2) minimal jar has fewer class conflicts and a smaller footprint
3) localizing tez to tez folder (#tez) allows better control of the
classpath to avoid java inconsistent classpath resolution of jars in same
directory
4) use cluster hadooplibs false avoids using the jars from the individuals
nodemanagers and only relies on jars listed in tez.lib.uris

  <property>
    <name>mapreduce.application.framework.path</name>

<value>/hdfs/path/hadoop-mapreduce-${mapreduce.application.framework.version}.tgz#hadoop-mapreduce</value>
  </property>

  <property>
    <name>tez.lib.uris</name>

<value>/hdfs/path/tez-0.9.2-minimal.tar.gz#tez,${mapreduce.application.framework.path}</value>
  </property>
  <property>
    <name>tez.lib.uris.classpath</name>
    <value>${mapreduce.application.classpath},./tez/*,./tez/lib/*</value>
  </property>
  <property>
    <name>tez.use.cluster.hadoop-libs</name>
    <value>false</value>
  </property>

On Thu, Dec 17, 2020 at 11:57 AM Lewis John McGibbney <le...@apache.org>
wrote:

> I tried the following configuration in tez-site.xml with no luck
>
> <configuration>
> <property>
>   <name>tez.lib.uris</name>
>
> <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> </property>
>
> <property>
>   <name>tez.lib.uris.classpath</name>
>   <value>${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> </property>
> </configuration>
>
> On 2020/12/17 17:35:28, Lewis John McGibbney <le...@apache.org> wrote:
> > Hi Zhiyuan,
> > Thanks for the guidance. I'm making progress but I am still battling
> initial configuration management issues.
> > I'm running HDFS and YARN v3.1.4 in pseudo-mode.
> > My tez-site.xml contains the following content
> >
> > <configuration>
> > <property>
> >   <name>tez.lib.uris</name>
> >
>  <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch</value>
> > </property>
> > </configuration>
> >
> > N.B. When I attempted to use the compressed Tez tar.gz, I was running
> into classpath issues which are largely documented in the installation
> documentation you pointed me to. I overcame these issues by simply
> uploading the minimal directory. All seems fine at this stage as I can run
> all of the Tez examples.
> >
> > I run into trouble when I try to run any job from the Nutch application.
> For example when I run the Injector one of the Nutch plugin extension
> points (x point org.apache.nutch.net.URLNormalizer) cannot be not found.
> The relevant log can be seen at https://paste.apache.org/4whoe.
> > I should note that the entire Nutch .job is available on HDFS at the URI
> defined in the tez-site.xml above.
> >
> > The output of jar -tf on the nutch.job artifact can be seen at
> https://paste.apache.org/hl8tk.
> > Am I required to somehow describe the structural heirarchy of this
> artifact in the tez.lib.uris.classpath configuration property?
> >
> > Thank you again for any guidance.
> >
> > lewismc
> >
> > On 2020/12/14 03:23:48, Zhiyuan Yang <zh...@apache.org> wrote:
> > > Hi Lewis,
> > >
> > > If there is no incompatibility, your existing job will run well on Tez
> > > without code change. You can just follow this guide
> > > <https://tez.apache.org/install.html> (especially step 4) to try it
> out.
> > >
> > > Thanks,
> > > Zhiyuan
> > >
> > > On Mon, Dec 14, 2020 at 9:04 AM Lewis John McGibbney <
> lewismc@apache.org>
> > > wrote:
> > >
> >
> >
>

Re: Porting legacy MapReduce application to Tez

Posted by Lewis John McGibbney <le...@apache.org>.

I tried the following configuration in tez-site.xml with no luck

<configuration>
<property>
  <name>tez.lib.uris</name>
  <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
</property>

<property>
  <name>tez.lib.uris.classpath</name>
  <value>${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
</property>
</configuration>

On 2020/12/17 17:35:28, Lewis John McGibbney <le...@apache.org> wrote: 
> Hi Zhiyuan,
> Thanks for the guidance. I'm making progress but I am still battling initial configuration management issues. 
> I'm running HDFS and YARN v3.1.4 in pseudo-mode.
> My tez-site.xml contains the following content
> 
> <configuration>
> <property>
>   <name>tez.lib.uris</name>
>   <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch</value>
> </property>
> </configuration>
> 
> N.B. When I attempted to use the compressed Tez tar.gz, I was running into classpath issues which are largely documented in the installation documentation you pointed me to. I overcame these issues by simply uploading the minimal directory. All seems fine at this stage as I can run all of the Tez examples. 
> 
> I run into trouble when I try to run any job from the Nutch application. For example when I run the Injector one of the Nutch plugin extension points (x point org.apache.nutch.net.URLNormalizer) cannot be not found. The relevant log can be seen at https://paste.apache.org/4whoe.
> I should note that the entire Nutch .job is available on HDFS at the URI defined in the tez-site.xml above.
> 
> The output of jar -tf on the nutch.job artifact can be seen at https://paste.apache.org/hl8tk.
> Am I required to somehow describe the structural heirarchy of this artifact in the tez.lib.uris.classpath configuration property?
> 
> Thank you again for any guidance.
> 
> lewismc
> 
> On 2020/12/14 03:23:48, Zhiyuan Yang <zh...@apache.org> wrote: 
> > Hi Lewis,
> > 
> > If there is no incompatibility, your existing job will run well on Tez
> > without code change. You can just follow this guide
> > <https://tez.apache.org/install.html> (especially step 4) to try it out.
> > 
> > Thanks,
> > Zhiyuan
> > 
> > On Mon, Dec 14, 2020 at 9:04 AM Lewis John McGibbney <le...@apache.org>
> > wrote:
> > 
> 
>

Re: Porting legacy MapReduce application to Tez

Posted by Lewis John McGibbney <le...@apache.org>.

Hi Zhiyuan,
Thanks for the guidance. I'm making progress but I am still battling initial configuration management issues. 
I'm running HDFS and YARN v3.1.4 in pseudo-mode.
My tez-site.xml contains the following content

<configuration>
<property>
  <name>tez.lib.uris</name>
  <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch</value>
</property>
</configuration>

N.B. When I attempted to use the compressed Tez tar.gz, I was running into classpath issues which are largely documented in the installation documentation you pointed me to. I overcame these issues by simply uploading the minimal directory. All seems fine at this stage as I can run all of the Tez examples. 

I run into trouble when I try to run any job from the Nutch application. For example when I run the Injector one of the Nutch plugin extension points (x point org.apache.nutch.net.URLNormalizer) cannot be not found. The relevant log can be seen at https://paste.apache.org/4whoe.
I should note that the entire Nutch .job is available on HDFS at the URI defined in the tez-site.xml above.

The output of jar -tf on the nutch.job artifact can be seen at https://paste.apache.org/hl8tk.
Am I required to somehow describe the structural heirarchy of this artifact in the tez.lib.uris.classpath configuration property?

Thank you again for any guidance.

lewismc

On 2020/12/14 03:23:48, Zhiyuan Yang <zh...@apache.org> wrote: 
> Hi Lewis,
> 
> If there is no incompatibility, your existing job will run well on Tez
> without code change. You can just follow this guide
> <https://tez.apache.org/install.html> (especially step 4) to try it out.
> 
> Thanks,
> Zhiyuan
> 
> On Mon, Dec 14, 2020 at 9:04 AM Lewis John McGibbney <le...@apache.org>
> wrote:
>

Re: Porting legacy MapReduce application to Tez

Posted by Zhiyuan Yang <zh...@apache.org>.

Hi Lewis,

If there is no incompatibility, your existing job will run well on Tez
without code change. You can just follow this guide
<https://tez.apache.org/install.html> (especially step 4) to try it out.

Thanks,
Zhiyuan

On Mon, Dec 14, 2020 at 9:04 AM Lewis John McGibbney <le...@apache.org>
wrote:

> Hi László,
> Thanks for your response
>
> On 2020/12/12 09:43:33, László Bodor <bo...@gmail.com> wrote:
> > Hi Lewis!
> >
> > Just for curiosity's sake, could you please point me to a place in nutch
> > code where some of the steps of the workflow are compiled into / done by
> > MapReduce?
>
> Please see my response to Zhiyuan earlier in this thread. I have broken
> down the Injector job and tried to describe the MapReduce logic without
> going into too many specifics. If would be greatly appreciated if you were
> able to take a look at that. Also, do you have any general guidance on how
> one would go about porting a MapReduce job to the Tez programming model?
> It's not clear to me how one identifies candidate Vertices and Edges. Thank
> you
>
> > Also - again for curiosity's sake - what about the adoption level of
> Apache
> > Nutch, could please send references about Nutch adopters? This looks like
> > an interesting project.
>
> Nutch is probably the most popular open source crawler. I understand that
> Doug Cutting and others began writing it and realized that in order to
> scale the Web crawler they needed a distributed computing model. The Hadoop
> project was born out of Nutch so that gives you an idea of how long it's
> been around for. I've been on the project for many years and have
> interacted with literally thousands of people on the mailing lists. I
> suspect that it is in deployment in a lot of places. I will also say that
> it is not a particularly easy code base to understand... it is quite
> complex. Even though Nutch has sensible default configuration,
> unfortunately it is notoriously difficult to configure as it has, similar
> to Hadoop, literally hundreds of configuration parameters which may need to
> be tuned.
>
> Thank you for assisting me with better understanding the process of
> evolving MapReduce jobs --> Tez.
> lewismc
>

Re: Porting legacy MapReduce application to Tez

Posted by Lewis John McGibbney <le...@apache.org>.

Hi László,
Thanks for your response

On 2020/12/12 09:43:33, László Bodor <bo...@gmail.com> wrote: 
> Hi Lewis!
> 
> Just for curiosity's sake, could you please point me to a place in nutch
> code where some of the steps of the workflow are compiled into / done by
> MapReduce?

Please see my response to Zhiyuan earlier in this thread. I have broken down the Injector job and tried to describe the MapReduce logic without going into too many specifics. If would be greatly appreciated if you were able to take a look at that. Also, do you have any general guidance on how one would go about porting a MapReduce job to the Tez programming model? It's not clear to me how one identifies candidate Vertices and Edges. Thank you

> Also - again for curiosity's sake - what about the adoption level of Apache
> Nutch, could please send references about Nutch adopters? This looks like
> an interesting project.

Nutch is probably the most popular open source crawler. I understand that Doug Cutting and others began writing it and realized that in order to scale the Web crawler they needed a distributed computing model. The Hadoop project was born out of Nutch so that gives you an idea of how long it's been around for. I've been on the project for many years and have interacted with literally thousands of people on the mailing lists. I suspect that it is in deployment in a lot of places. I will also say that it is not a particularly easy code base to understand... it is quite complex. Even though Nutch has sensible default configuration, unfortunately it is notoriously difficult to configure as it has, similar to Hadoop, literally hundreds of configuration parameters which may need to be tuned.

Thank you for assisting me with better understanding the process of evolving MapReduce jobs --> Tez.
lewismc

Re: Porting legacy MapReduce application to Tez

Posted by László Bodor <bo...@gmail.com>.

Hi Lewis!

Just for curiosity's sake, could you please point me to a place in nutch
code where some of the steps of the workflow are compiled into / done by
MapReduce?
Also - again for curiosity's sake - what about the adoption level of Apache
Nutch, could please send references about Nutch adopters? This looks like
an interesting project.

Thanks,
Laszlo Bodor

On Fri, 11 Dec 2020 at 04:51, Zhiyuan Yang <zh...@apache.org> wrote:

> I think the first step can be simply trying replacing what you currently
> have in MapReduce with Tez, instead of trying to integrate the entire data
> flow into Tez in a single step. My concern is whether there are some
> unpopular MapReduce features you rely on but are not supported by Tez yet.
>
> Thanks,
> Zhiyuan
>
> On Fri, Dec 11, 2020 at 10:40 AM Lewis John McGibbney <le...@apache.org>
> wrote:
>
>> Hi user@,
>>
>> First a couple of things
>> 1. Thanks to Jonathan Eagles (who I spoke to offlist) for explaining a
>> bit about the Tez community. When I looked through the mailing lists, even
>> thought you guys just made a release, I wasn't sure if the project was
>> alive and kicking. Thanks Jonathan for confirming.
>> 2. Based on my digging through documentation and YouTube videos, I pulled
>> together TEZ-4257 [0] and the corresponding pull request [1]. I also saw
>> that the TravisCI build was broken so I produced TEZ-4258.
>>
>> Now, the important stuff... I'm a long time developer of the Apache Nutch
>> project [2]; a well matured, production ready Web crawler. Nutch relyies on
>> Apache Hadoop data structures relying heavily on MapReduce.
>>
>> A typical Nutch crawl lifecycle involves the following steps
>> * inject - from a seed list either create or inject entries into an
>> existing crawl database
>> * generate - fetch lists from suitable entries present within the crawl
>> database
>> * fetch - URL partitions
>> * parse - extract data and metadata from the fetched content
>> * updatedb - based upon what was fetched, update the crawl database
>> * wash, rinse repeat (there are other steps like indexing however for
>> simplicity lets leave those steps out for now)
>>
>> I recently started a thread over on the Nutch dev@ list to see if there
>> is any interest in investigating what it would take to evolve Nutch from
>> MapReduce --> Tez. In order to understand the programming model I looked at
>> the Tez Javadoc and examples both of which have been useful.
>>
>> I suppose I have one basic question. Given my brief explanation of the
>> crawl cycle above, should I be looking to implement just one DAG covering
>> the entire crawl cycle? Or something else?
>>
>> Currently we automate the crawl cycle via a bash script with each step
>> executed in sequence. There are several appealing reasons why an explicit
>> data flow programming model would be advantageous but I just need clarity
>> on the correct approach.
>>
>> Thank you for any assistance.
>> lewismc
>>
>> [0] https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4257
>> [1] https://github.com/apache/tez/pull/82
>> [2] http://nutch.apache.org
>>
>

Re: Porting legacy MapReduce application to Tez

Posted by Zhiyuan Yang <zh...@apache.org>.

I think the first step can be simply trying replacing what you currently
have in MapReduce with Tez, instead of trying to integrate the entire data
flow into Tez in a single step. My concern is whether there are some
unpopular MapReduce features you rely on but are not supported by Tez yet.

Thanks,
Zhiyuan

On Fri, Dec 11, 2020 at 10:40 AM Lewis John McGibbney <le...@apache.org>
wrote:

> Hi user@,
>
> First a couple of things
> 1. Thanks to Jonathan Eagles (who I spoke to offlist) for explaining a bit
> about the Tez community. When I looked through the mailing lists, even
> thought you guys just made a release, I wasn't sure if the project was
> alive and kicking. Thanks Jonathan for confirming.
> 2. Based on my digging through documentation and YouTube videos, I pulled
> together TEZ-4257 [0] and the corresponding pull request [1]. I also saw
> that the TravisCI build was broken so I produced TEZ-4258.
>
> Now, the important stuff... I'm a long time developer of the Apache Nutch
> project [2]; a well matured, production ready Web crawler. Nutch relyies on
> Apache Hadoop data structures relying heavily on MapReduce.
>
> A typical Nutch crawl lifecycle involves the following steps
> * inject - from a seed list either create or inject entries into an
> existing crawl database
> * generate - fetch lists from suitable entries present within the crawl
> database
> * fetch - URL partitions
> * parse - extract data and metadata from the fetched content
> * updatedb - based upon what was fetched, update the crawl database
> * wash, rinse repeat (there are other steps like indexing however for
> simplicity lets leave those steps out for now)
>
> I recently started a thread over on the Nutch dev@ list to see if there
> is any interest in investigating what it would take to evolve Nutch from
> MapReduce --> Tez. In order to understand the programming model I looked at
> the Tez Javadoc and examples both of which have been useful.
>
> I suppose I have one basic question. Given my brief explanation of the
> crawl cycle above, should I be looking to implement just one DAG covering
> the entire crawl cycle? Or something else?
>
> Currently we automate the crawl cycle via a bash script with each step
> executed in sequence. There are several appealing reasons why an explicit
> data flow programming model would be advantageous but I just need clarity
> on the correct approach.
>
> Thank you for any assistance.
> lewismc
>
> [0] https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4257
> [1] https://github.com/apache/tez/pull/82
> [2] http://nutch.apache.org
>