You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2015/04/14 12:43:30 UTC

[Nutch Wiki] Update of "SumanSaurabh/GSoC2015Nutch" by SumanSaurabh

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "SumanSaurabh/GSoC2015Nutch" page has been changed by SumanSaurabh:
https://wiki.apache.org/nutch/SumanSaurabh/GSoC2015Nutch?action=diff&rev1=2&rev2=3

   .
   . '''1.2) Workspace Setup:'''
  
-  . Nutch  workspace it built on Ant+Ivy. I have experience with Ant build  framework, so workspace setup would be relatively easier. I have forked  the Nutch codebase to my Git '''[2]''' and after successful completion I will  provide the patch. Meanwhile I will also try to resolve issues mentioned  in Nutch Jira.
+  . Nutch  workspace it built on Ant+Ivy. I have experience with Ant build  framework, so workspace setup would be relatively easier. I have forked  the Nutch codebase to my Git '''[2]''' and after successful completion I will  provide the patch. 
+  Nutch dependency on Hadoop: ''hadoop-core.1.x.jar'' is changed in ''Hadoop 2.x''
+  . {{{
+ <dependency org="org.apache.hadoop" name="hadoop-core" rev="1.2.0" conf="*->default">
+    <exclude org="hsqldb" name="hsqldb" />
+    <exclude org="net.sf.kosmosfs" name="kfs" />
+    <exclude org="net.java.dev.jets3t" name="jets3t" />
+    <exclude org="org.eclipse.jdt" name="core" />
+    <exclude org="org.mortbay.jetty" name="jsp-*" />
+    <exclude org="ant" name="ant" />
+ </dependency>
+ }}}
  
+  . Following dependency needs to be added for Hadoop 2.6 support instead of above.
+  . {{{
+ <dependency org="org.apache.hadoop" name="hadoop-common" rev="2.6.0" conf="*->default" />
+ <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" rev="2.6.0" conf="*->default" />
+ }}}
   .
   . '''1.3) Experimental setup with of Nutch with Hadoop and their result:'''
  
   . I  have been using Hadoop 2.3 for my !MapReduce application and while  trying to setup Nutch 1.9 with Hadoop 2.3. I ran into following error:
- 
+  . {{{
-  . Injector:
+ Injector:
-  . java.lang.!UnsupportedOperationException: Not implemented by the !DistributedFileSystem !FileSystem                implementation
+   java.lang.!UnsupportedOperationException: Not implemented by the !DistributedFileSystem !FileSystem implementation
-   . at org.apache.hadoop.fs.!FileSystem.getScheme(!FileSystem.java:214)
+   at org.apache.hadoop.fs.!FileSystem.getScheme(!FileSystem.java:214)
- 
-   . at org.apache.hadoop.fs.!FileSystem.loadFileSystems(!FileSystem.java:2365)
+   at org.apache.hadoop.fs.!FileSystem.loadFileSystems(!FileSystem.java:2365)
+   at org.apache.hadoop.fs.!FileSystem.getFileSystemClass(!FileSystem.java:2375) 
+   at org.apache.hadoop.fs.!FileSystem.createFileSystem(!FileSystem.java:2392)
- 
-   . at org.apache.hadoop.fs.!FileSystem.getFileSystemClass(!FileSystem.java:2375) at org.apache.hadoop.fs.!FileSystem.createFileSystem(!FileSystem.java:2392)
- 
-   . at org.apache.hadoop.fs.!FileSystem.access$200(!FileSystem.java:89)
+   at org.apache.hadoop.fs.!FileSystem.access$200(!FileSystem.java:89)
+   at org.apache.hadoop.fs.!FileSystem$Cache.getInternal(!FileSystem.java:2431) 
+   at org.apache.hadoop.fs.!FileSystem$Cache.get(!FileSystem.java:2413)
- 
-   . at org.apache.hadoop.fs.!FileSystem$Cache.getInternal(!FileSystem.java:2431) at org.apache.hadoop.fs.!FileSystem$Cache.get(!FileSystem.java:2413)
- 
-   . at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:368)
+   at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:368)
- 
-   . at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:167)
+   at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:167)
- 
-   . at org.apache.nutch.crawl.Injector.inject(Injector.java:297)
+   at org.apache.nutch.crawl.Injector.inject(Injector.java:297)
-   . at org.apache.nutch.crawl.Injector.run(Injector.java:380)
+   at org.apache.nutch.crawl.Injector.run(Injector.java:380)
-   . at org.apache.hadoop.util.!ToolRunner.run(!ToolRunner.java:70)
+   at org.apache.hadoop.util.!ToolRunner.run(!ToolRunner.java:70)
- 
-   . at org.apache.nutch.crawl.Injector.main(Injector.java:370) .
+   at org.apache.nutch.crawl.Injector.main(Injector.java:370) .
- 
+ }}}
   . May be I will start looking at this point onwards?
  
  == Phase 2 (Coding): ==
-  . 2.1) Migrating from Hadoop 1.x to Hadoop 2.x
+  . '''2.1) Migrating from Hadoop 1.x to Hadoop 2.x'''
    . '''Binary Compatibility'''''' ''':
  
    . First, we ensure binary compatibility to the applications that use old '''mapred''' APIs. This means that applications which were built against MRv1 '''mapred''' APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration.
@@ -139, +149 @@

  
    . '''Source Compatibility:'''
  
-   . One cannot ensure complete binary compatibility with the applications that use '''mapreduce''' APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility for '''mapreduce''' APIs that break binary compatibility. In other words, users should recompile their applications that use '''mapreduce''' APIs against MRv2 jars. One notable binary incompatibility break is '''Counter''' in
+   . One cannot ensure complete binary compatibility with the applications that use '''mapreduce''' APIs, as these APIs have evolved a lot since MRv1. n other words, users should recompile their applications that use '''mapreduce''' APIs against MRv2 jars. One notable binary incompatibility break is '''Counter''' in
  
+   .{{{
+ Package: crawl
-   . <<BR>>
- 
-   . Package: '''crawl '''
- 
-   . <<BR>>
- 
-   . Class: '''!CrawlDbUpdateUtil '''
+ Class: CrawlDbUpdateUtil
+ }}}
- 
-   . <<BR>>
- 
-   . i.e. '''crawl/CrawlDbUpdateUtil.java''' .
- 
-   . <<BR>>
     .
    '''Tradeoffs between MRv1 Users and MRv2 Adopters:'''
  
-   . ''' ''' Unfortunately, maintaining binary compatibility for MRv1 applications  may lead to binary incompatibility issues for early MRv2 adopters.  Below is the  list of MapReduce APIs which are incompatible with Hadoop 1.3.
+   .  Unfortunately, maintaining binary compatibility for MRv1 applications  may lead to binary incompatibility issues for early MRv2 adopters.  Below is the  list of MapReduce APIs which are incompatible with Hadoop 1.3.
  
-   . <<BR>>
+   .{{{
-    * '''''org.apache.hadoop.mapreduce.Job'''''#''failTask'' <--> Return type changes from void to boolean
+ org.apache.hadoop.mapreduce.Job#failTask <--> Return type changes from void to boolean
-    * '''''org.apache.hadoop.mapreduce.Job'''''#''killTask'' <--> Return type changes from void to boolean
+ org.apache.hadoop.mapreduce.Job#killTask <--> Return type changes from void to boolean
-    * '''''org.apache.hadoop.mapreduce.Job'''''#''getTaskCompletionEvents'' <--> Return type changes from o.a.h.mapred.!TaskCompletionEvent to o.a.h.mapreduce.!TaskCompletionEvent
+ org.apache.hadoop.mapreduce.Job#getTaskCompletionEvents <--> Return type changes from o.a.h.mapred.!TaskCompletionEvent to o.a.h.mapreduce.!TaskCompletionEvent
+ }}}
  
  <<BR>>
  
   . '''2.2) Configuring and Running MRv2 Clusters'''
  
-  . '''      Configuration Migration'''
+  .        '''Configuration Migration'''
    . Since !MapReduce 1 functionality has been split into two components, !MapReduce cluster configuration options have been split into YARN configuration options, which go in yarn-site.xml, and !MapReduce configuration options, which go in mapred-site.xml.
  
    .
@@ -177, +179 @@

  
    .
  
-   . A minimal configuration required to run MRv2 jobs on YARN is '''yarn-site.xml configuration''', '''mapred-site.xml configuration. '''Details present in the link [3].
+   . A minimal configuration required to run MRv2 jobs on YARN is '''yarn-site.xml configuration'''. Detailed configuration for ''conf/mapred-site.xml'',  ''conf/core-site.xml'', ''conf/hdfs-site.xml'' will remain same.
+   . Configuration for ''conf/yarn-site.xml'' is:
+   .{{{
+ <configuration>
+     <property>
+         <name>yarn.nodemanager.aux-services</name>
+         <value>mapreduce_shuffle</value>
+     </property>
+ </configuration>
+ }}}
  
    . <<BR>>
   '''2.3) Summary of Configuration Changes which I have observed'''
  
-  . '''1) !JobTracker Properties and !ResourceManager Equivalents'' '''''
+  . '''1) !JobTracker Properties and !ResourceManager Equivalents'''
-   . <<BR>>
+  .{{{
-    * '''''mapred.job.tracker''''' to '''''yarn.resourcemanager.hostname'''''
+ mapred.job.tracker to yarn.resourcemanager.hostname
-     . Package: '''crawl'''
-     . Class: '''Generator''' <<BR>>
+ Package: crawl
+ Class: Generator
+ }}}
  
-  . 2) MRv1 Properties that have no MRv2 Equivalents'''''  '''''
+  . '''2) MRv1 Properties that have no MRv2 Equivalents'''
-   * '''''mapred.temp.dir''''' has no MRV2 eqivalaent<<BR>>Package: '''crawl''' <<BR>>Class: '''!DeduplicationJob'''
+  .{{{
+ mapred.temp.dir has no MRV2 eqivalaent
+ Package: crawl
+ Class: DeduplicationJob
+ }}}
  
  == Phase 3 (Documentation): ==
   1. Documentation leading to the detailed description of migration of Hadoop framework in Apache Nutch.
@@ -237, +253 @@

   . <<BR>>''' '''
  
  = References: =
-  . [1] http://wiki.apache.org/nutch/FrontPage
+  . [1] http://wiki.apache.org/nutch/FrontPage<<BR>>
- 
-  . <<BR>> [2] https://github.com/sumansaurabh/nutch <<BR>>
+  . [2] https://github.com/sumansaurabh/nutch <<BR>>
- 
   . [3] https://sites.google.com/site/nutch1936/home/3-methodology <<BR>>
- 
-  . [4] http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_ig_mapreduce_to_yarn_migrate.html
+  . [4] http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_ig_mapreduce_to_yarn_migrate.html<<BR>>
- 
-  . <<BR>>
- 
-  . [5] [[http://www.slideshare.net/wattsteve/web-crawling-and-data-gathering-with-apache-nutch?related=2|http://www.slideshare.net/tshooter/strata-conf2014]]
+  . [5] [[http://www.slideshare.net/wattsteve/web-crawling-and-data-gathering-with-apache-nutch?related=2|http://www.slideshare.net/tshooter/strata-conf2014]]<<BR>>
- 
-  . <<BR>>
- 
-  . [6] http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html
+  . [6] http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html<<BR>
- 
-  . <<BR>>
- 
   . [7] http://www.slideshare.net/wattsteve/web-crawling-and-data-gathering-with-apache-nutch?related=2''' '''
  
-  .
-  .
-