You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2015/03/29 15:50:10 UTC

[Nutch Wiki] Update of "SumanSaurabh/GSoC2015Nutch" by SumanSaurabh

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "SumanSaurabh/GSoC2015Nutch" page has been changed by SumanSaurabh:
https://wiki.apache.org/nutch/SumanSaurabh/GSoC2015Nutch

New page:
= Goal : Move Nutch 2.x to Hadoop 2.X from existing 1.x codebase. =
The following page is a proposal for GSoC 2015 related to issue [[https://issues.apache.org/jira/browse/NUTCH-1936|Nutch-1936]]

= Introduction =
 . '''1) About Nutch:'''

 . [[http://nutch.apache.org/|Apache Nutch]] is a flexible and powerful  open source tool for web crawling. It builds on [[http://lucene.apache.org/solr/|Apache Solr]] and comes with an integration of the highly popular [[http://hadoop.apache.org/|Apache Hadoop]]. Whole-web crawling is designed to handle very large crawls which may  take weeks to complete, running on multiple machines. This also permits  more control over the crawl process, and incremental crawling. It is  important to note that whole web crawling does not necessarily mean  crawling the entire world wide web. We can limit a whole web crawl to  just a list of the URLs we want to crawl.
 .
 . '''2) Basic Nutch Features: '''
  * Runs on top of '''Hadoop'''
  * Scalable: billions of pages possible
  * Some overhead (if scale is not a requirement)
  * Not ideal for low latency

  . Customizable / extensible plug-in architecture *  Pluggable protocols (document access)
  * URL filters + normalizers
  * Parsing: document formats + meta data extraction
  * Indexing back-ends
  * Mostly used to feed a search index
 '''4) Nutch Work-flow?'''[[https://sites.google.com/site/nutch1936/home/introduction/Nutch_Overview.png?attredirects=0|{{https://sites.google.com/site/nutch1936/_/rsrc/1427176500763/home/introduction/Nutch_Overview.png|https://sites.google.com/site/nutch1936/home/introduction/Nutch_Overview.png?attredirects=0|width="100%"}}]]'''<<BR>>'''
 . '''5) Nutch Workflow execution with Hadoop'''
 . * Every step is implemented as one (or more) '''MapReduce''' job. <<BR>>
  * Inject, generate, fetch, parse, updatedb, invertlinks, index.  <<BR>>
  * local mode
   * works out-of-the-box (bin package)
   * useful for testing and debugging
  * (pseudo-)distributed mode <<BR>>
   * parallelization, monitor crawls with MapReduce web UI. <<BR>>
   * recompile and deploy job file with configuration changes.
  * In basic terms:
   * Map-reduce indexing<<BR>>
    * Map() just assembles all parts of documents. <<BR>>
    * Reduce() performs text analysis + indexing.<<BR>>
    * Sends assembled documents to Solr                                        or                                    – adds to a local Lucene index Nutch.

  . Nutch runs in two modes;  namely '''local''' and '''deploy'''. When run in  local mode e.g. running Nutch in a single process on one machine, then  we use Hadoop as a dependency. This may suit if we have a  small site to crawl and index. .  . Nutch is mostly used because of  its capability to run on in deploy mode, within a Hadoop cluster. This  gives the benefit of a distributed file system (HDFS) and MapReduce processing style'''.'''

'''<<BR>>'''

 . '''6) Why Hadoop 2.x over 1.x ?<<BR>>'''
  . The major difference between Hadoop 1.x and 2.x is the computation platform they use.
 [[https://sites.google.com/site/nutch1936/home/introduction/yarn.png?attredirects=0|{{https://sites.google.com/site/nutch1936/_/rsrc/1427213159891/home/introduction/yarn.png|Hadoop 2.0 vs 1.0}}]]

 . 1.x uses MRv1 whereas 2.x uses MRv2(aka YARN).
 MRv1: Master -> JobTracker
  . Slave -> TaskTracker
 MRv2: Master -> Resource Manager
  . Slave -> Node Manager
  . And there is Application Specific Application Master .

  . Problems with 1 .X 1. No Horizontal Scaling  - V1

  * Single Point of Failure for Name Node
  * Impossible to run Non Map Reduce tools because of tight coupling of JobTracker + MR
  * Does not support Multi-tenancy
  * Job Tracker overburdened because of too much work.
 The  functionality of the JobTracker in 1.x split into 2 components  :- '''Application Specific Application Master''' and '''Global Resource Manager'''. MRv2  introduced a concept of 'container'. Container is nothing but bunch of  resources such as 'x amount of memory, y number of cores'. Allocating  'containers' for tasks is done by Resource Manager and tasks are  actually launched by Application Master in the allocated containers. As a result,
  * There  are no more map/reduce dedicated slots on each tasktracker, instead  there is a 'container' allocated for each task in an application.
  * Hadoop 2.x scales better.
  * NNHA: Name Node High availability for avoiding Single point of failure.
  * Hadoop 2.x is more general than Hadoop 1.x.
  * V2 now supports application other MR as well . You can do real time processing using '''Apache Storm'''.
 '''7) References: '''
  . [1] https://wiki.apache.org/nutch/NutchTutorial [2] https://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html [3] https://wiki.apache.org/nutch/NutchHadoopTutorial [4] http://www.slideshare.net/wattsteve/web-crawling-and-data-gathering-with-apache-nutch?related=2 [5] http://adrianmejia.com/blog/2012/02/04/get-started-with-the-web-crawler-apache-nutch-1-x/ [6] http://www.quora.com/What-are-the-differences-between-Hadoop-0-X-1-X-2-X [7] http://stackoverflow.com/questions/19915569/what-are-the-differences-between-hadoop-versions [8] http://www.dataenthusiast.com/2014/09/hadoop-2-0-yarn-architecture/<<BR>>||

= Methodology =
== Phase 1(Learning & Experimenting): ==
 . '''1.1) Explore Nutch Documentation:'''
 . Since I have less knowledge about Nutch codebase, I will likely cover Nutch documentation '''[1]'''.
 .
 . '''1.2) Workspace Setup:'''
 . Nutch  workspace it built on Ant+Ivy. I have experience with Ant build  framework, so workspace setup would be relatively easier. I have forked  the Nutch codebase to my Git '''[2]''' and after successful completion I will  provide the patch. Meanwhile I will also try to resolve issues mentioned  in Nutch Jira.
 .
 . '''1.3) Experimental setup with of Nutch with Hadoop and their result:'''
 . I  have been using Hadoop 2.3 for my MapReduce application and while  trying to setup Nutch 1.9 with Hadoop 2.3. I ran into following error:

 . Injector:
 . java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem                implementation
  . at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:214)
  . at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2365)
  . at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
  . at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
  . at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
  . at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
  . at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
  . at org.apache.nutch.crawl.Injector.inject(Injector.java:297)
  . at org.apache.nutch.crawl.Injector.run(Injector.java:380)
  . at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
  . at org.apache.nutch.crawl.Injector.main(Injector.java:370) .

 . May be I will start looking at this point onwards?

== Phase 2 (Coding): ==
 . 2.1) Migrating from Hadoop 1.x to Hadoop 2.x
 . . '''Binary Compatibility'''''' ''':
  . First, we ensure binary compatibility to the applications that use old '''mapred''' APIs. This means that applications which were built against MRv1 '''mapred''' APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration.

  . '''Source Compatibility'''

  . One cannot ensure complete binary compatibility with the applications that use '''mapreduce''' APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility for '''mapreduce''' APIs that break binary compatibility. In other words, users should recompile their applications that use '''mapreduce''' APIs against MRv2 jars.

  . One notable binary incompatibility break is '''Counter''' in

  . Package: crawl''' '''

  . Class: !CrawlDbUpdateUtil

  . i.e. '''crawl/CrawlDbUpdateUtil.java''' .

  .

  . '''Tradeoffs between MRv1 Users and MRv2 Adopters '''

  . Unfortunately, maintaining binary compatibility for MRv1 applications  may lead to binary incompatibility issues for early MRv2 adopters.  Below is the  list of !MapReduce APIs which are incompatible with Hadoop 1.3.

 .