You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2015/03/29 20:07:07 UTC

[Nutch Wiki] Trivial Update of "MohitBagde" by MohitBagde

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "MohitBagde" page has been changed by MohitBagde:
https://wiki.apache.org/nutch/MohitBagde

New page:
##language:en
MohitBagde/Gsoc2015ProjectProposal

'''GSoc 2015 – Project Proposal for “NUTCH-1936''' -''' GSoC 2015 - Move Nutch to Hadoop 2.X”'''

'''Name''': Mohit Bagde (Email: <<MailTo(bagde AT usc DOT edu)>>)

'''University''': University of Southern California

'''Mentor Name: '''Lewis John McGibbney

'''Abstract: '''The main goal of this project is the porting of the entire Apache Trunk codebase to the new Hadoop 2.X API [1]. It will involve the complete overhaul of all the main jobs involved in a Nutch crawler viz. Injecting, Fetching, Parsing, De-duplication and Indexing to the new Hadoop MRv2 API from the current Hadoop 1.X codebase.

'''Content: '''

'''Motivation for the Project'''

My name is Mohit Bagde.  I am a Graduate student currently pursuing my Master’s degree in Computer Science at the University of Southern California. I am strongly self-motivated because of my interest in this field and to do valuable, guided research in Data Informatics, Information Retrieval and Database Systems. I am aware that R group and Google expect very high standards from its students. On my part, I can assure you of hard work and consistency. I believe that my enthusiasm will enable me to meet those expectations.

In my current semester, I have taken CS572 Information Retrieval and Search Engines under Prof. Chris Mattmann and have worked on Nutch 1.X [1] as part of the first assignment which involved crawling with Nutch and integrating with Tika and subsequently developing a plugin in Nutch. I have also taken INF 550 under Prof. Seon Kim where I have written programs in HDFS using Map Reduce and I find that both these subjects have a common point in the JIRA issue NUTCH-1936 [2] which is about porting Nutch to Hadoop 2.X. I enjoyed working with Nutch and found the entire experience to be very knowledgeable. I would like to continue to develop and contribute to Nutch in any which way possible.

'''Apache Nutch – About and Workflow'''

Apache Nutch is a highly scalable and robust web crawler that is also extremely polite and obeys the rules of robots.txt file for the websites that it crawls [3]. Nutch, developed by Doug Cutting (who also created Lucene and Hadoop), now has two separate codebases namely the 1.X and 2.X. Although Nutch is written in Java, it makes use of various “plugin” like modules that allow developers to implement their own parsers, deduplication algorithms and indexer interfaces.

Apache Hadoop was an Incubator sub-project [4] that was derived from Apache Nutch as Nutch required significant processing power to perform multi-machine web crawling and indexing. This came about in the form of !MapReduce tasks and the HDFS system. Nutch runs on a Hadoop cluster that scales well to the order of ~ 100 machines. However, a user can run the Nutch configuration on a local machine by configuring Hadoop to run in standalone or Pseudo-distributed mode [5] and thus achieve a comparably sized processing power as that of running Hadoop over a cluster of machines.

The primary advantage of Apache Nutch lies in its customizable pluggable interfaces or “plugins” [6] as they are termed. Below is a fairly simple architecture of how Nutch performs its crawling and indexing.

{{http://www.atlantbh.com/wp-content/uploads/2012/03/Apache-Nutch-Flowchart-e1331646794565.png||height="348",width="678"}}

'''Figure 1 – Apache Nutch Flowchart [7]'''

Having done a few simple crawls, the overall process is as follows:

1.       Initially, we create an empty directory containing the files that will form the seed URLs. (URL files). Then we run the inject command to inject the seed URL into the crawlDB. The crawlDB stores meta-data about the crawled URLs. (URL DB)

2.       Then, the generate command is run to create segments that contain a list of URLs that have been successfully crawled. (db_success flag will be set for these files in the crawlDB) (SEGMENT)

3.       The fetcher command will acquire the content of those URLs on the fetchlist and store it in the segment directory created by generate. This step takes anywhere between a few minutes (2-3 rounds of crawling) to days (30-40 rounds of crawling) depending on the number of rounds that the crawler is run for. Crawls can be optimized by modifying certain parameters in the nutch-site.xml that overrides the nutch-default config. [8] (CONTENT)

4.       Nutch is integrated with Apache Tika which is a general framework of parsers [9] to extract the content and resulting metadata from the URLs that have been fetched. Subsequently, the parse command is run to parse the content from the websites that have been fetched. It can also be run to update the content of the crawlDB during a re-crawl. (Nutch re-crawling can be done to update the crawlDB to incorporate changes made to data of the crawled URLs). Also, HTMLParser removes all <html> tags from the documents that are fetched. (PARSED_DATA)

5.       Finally, before indexing the parsed_data with Apache Solr, Nutch will perform an inversion of the links. This is due to the following paradigm that “it is not of interest to account for the number of outgoing links, instead, we should account for the number of inbound links”. This is quite similar to how Google PageRank works and is important for the scoring function. The inverted links are saved in the linkdb. The linkdb also stores information about all links known to each URL fetched by Nutch. (INVERTLINK DB)

6.       The last step is optional in Nutch as of version 1.10 trunk. One can index the final 6 directories that are created (as shown in the directory structure). The SolrIndex can be used along with Lucene’s library. And after having installed Solr [10], one can go to the localhost:8983/solr/admin (if on jetty) and query on the data that they have crawled via Nutch.

'''Apache Hadoop – 1.X vs 2.X Issues'''

The current version of Apache Hadoop is 2.6. At the time of implementation of Apache Nutch, however it was running Hadoop 1.X. The main goal of this project is to migrate Hadoop 1.X to 2.X. Before moving on to how we migrate these changes in Nutch, we must first understand what the key differences are between 1.X and 2.X and what changes must be incorporated to ensure that their both binary and source compatibility between the two version for the Nutch trunk.
||<tablestyle="border-collapse:collapse;border:none;mso-border-alt:solid windowtext .5pt;mso-yfti-tbllook:1184;mso-padding-alt:0in 5.4pt 0in 5.4pt" tableclass="MsoTableGrid"rowstyle="mso-yfti-irow:0;mso-yfti-firstrow:yes"width="151px" style="border:solid windowtext 1.0pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Difference''' ||<width="216px" style="border:solid windowtext 1.0pt;border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Hadoop 1.X''' ||<width="271px" style="border:solid windowtext 1.0pt;border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Hadoop 2.X''' ||
||<rowstyle="mso-yfti-irow:1"width="151px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Number of nodes''' ||<width="216px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">~4,000 nodes per cluster ||<width="271px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">~10,000 nodes per cluster ||
||<rowstyle="mso-yfti-irow:2"width="151px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Running Time''' ||<width="216px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">O(#nodes in cluster) ||<width="271px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">O(cluster size) ||
||<rowstyle="mso-yfti-irow:3"width="151px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Namespace Config''' ||<width="216px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Only 1 namespace node ||<width="271px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Multiple namespaces for managing HDFS ||
||<rowstyle="mso-yfti-irow:4"width="151px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Application support''' ||<width="216px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Only able to run Map and reduce jobs, that are static ||<width="271px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Able to run any java apps that can integrate with Hadoop ||
||<rowstyle="mso-yfti-irow:5;mso-yfti-lastrow:yes"width="151px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Efficiency''' ||<width="216px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Bottleneck lies in the JobTracker    for both resource management and taskTracker task scheduling ||<width="271px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Uses YARN (Yet Another Resource Negotiator) to perform effective   cluster management ||




 . '''Table 1.1 – Key difference in Hadoop 1.X and 2.X [11]'''

Although this table does not highlight all the differences between the two codebases, it is a good start to start exploring what changes must be made to Apache Nutch’s tasks to port it to 2.X. In Apache Hadoop 2.x the part that deals with resource management capabilities has been placed into Apache Hadoop YARN, a general purpose, distributed application management framework while Apache Hadoop MapReduce (aka MRv2) and it remains as a pure distributed computation framework.

So the crux of the project would be to ensure binary and source compatibility of the applications that use old '''mapred''' APIs in Nutch. For the case of binary compatibility, this means that applications which were built against MRv1 '''mapred''' APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration. However, we cannot ensure complete binary compatibility with the applications that use '''mapreduce''' APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility for '''mapreduce''' APIs that break binary compatibility. In other words, users should recompile their applications that use '''mapreduce''' APIs against MRv2 jars. One notable binary incompatibility break is Counter and CounterGroup.

In general, MRv2 is able to ensure satisfactory compatibility with MRv1 applications. However, due to some improvements and code re-factorings, a few APIs have been rendered backward-incompatible. The first quarter of the project will deal with identifying what APIs are being utilized by Nutch that have been rendered backward incompatible and discussing these issues as part of the NUTCH-1219 JIRA ticket [12]. I have already begun to look into this and will continue to do so as the project proceeds.

'''Project Plan'''

I believe that the project can be completed in a total of 4 quarters with each quarter being an increment over its predecessor. I propose the following plan for completing the project along with a timeline attached for each quarter, including deadlines and report submissions.
||<tablestyle="border-collapse:collapse;border:none;mso-border-alt:solid windowtext .5pt;mso-yfti-tbllook:1184;mso-padding-alt:0in 5.4pt 0in 5.4pt" tableclass="MsoTableGrid"rowstyle="mso-yfti-irow:0;mso-yfti-firstrow:yes"width="67px" style="border:solid windowtext 1.0pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Quarter''' ||<width="90px" style="border:solid windowtext 1.0pt;border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Dates''' ||<width="481px" style="border:solid windowtext 1.0pt;border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Content''' ||
||<rowstyle="mso-yfti-irow:1"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q1''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">03/27/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Initial Project Proposal draft deadline ||
||<rowstyle="mso-yfti-irow:2"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q1''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">03/28/2015 to 04/15/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Work done during this phase will mostly be   oriented towards studying the documentation of the Apache Nutch and Hadoop. I   will also start with drafting the first report at this time. This phase would   be a fairly long as I will have to read, understand and discuss the   documentation with my mentor on a regular basis. Various issues like 2.X binary   and source compatibilities, YARN configurations, identification of mapred   APIs issues in Nutch trunk to be addressed [13]. ||
||<rowstyle="mso-yfti-irow:3"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q1''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">04/16/2015 to 04/23/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">In this short period of time, I will work on   crawling with Nutch and identifying potential incompatibilities when it is   deployed over Hadoop. I have already used Nutch as part of my coursework in   college and I already know how to crawl, inject, fetch, parse and index   documents. I also have a Cloudera Distribution [14] (CDH) of Hadoop and HBase   along with other HDFS framework technologies (like Sqoop and Flume) and can   easily test the Map and Reduce tasks that are being used in Nutch at this   point. ||
||<rowstyle="mso-yfti-irow:4"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q1''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">04/24/2015 to 05/01/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Study break as I have my final exams on 30^th^   April and 1^st^ May and will require some time to prepare   accordingly for it. ||
||<rowstyle="mso-yfti-irow:5"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q1''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">05/02/2015 to 05/22/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">I will submit a rough draft of the first report   to the mentor and discuss issues that have been solved and issues that have   not yet been tackled. Additional reading and resources that have to be   understood and related work done on this area will be addressed during this   time frame. Also, will begin work on trying to resolve JIRA issue NUTCH-1936   and 1219. ||
||<rowstyle="mso-yfti-irow:6"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q1''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">05/23/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Deadline for the submission of First report ||
||<rowstyle="mso-yfti-irow:7"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q2''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">05/24/2015 to 06/10/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Coding work to begin on the issues handled,   discussed with mentor in the Quarter 1 phase of the project. Will need to   keep track of any new bugs and fixes that pop up during this phase. Second   report rough draft to be completed. ||
||<rowstyle="mso-yfti-irow:8"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q2''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">06/11/2015 to 06/20/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">First half of coding work completion deadline   and discussion of second report with project mentor. Also, NUTCH-1219 to be   resolved by the end of this period. ||
||<rowstyle="mso-yfti-irow:9"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q3''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">06/21/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Submission of Second report incorporating   NUTCH-1219 resolution ||
||<rowstyle="mso-yfti-irow:10"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q3''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">06/22/2015 to 07/02/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Remainder of coding work and issues in second   report to be completed. Discussion with mentor on handling of the Mid-term   evaluation to be done. ||
||<rowstyle="mso-yfti-irow:11"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q3''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">07/03/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Mid-term evaluation done by Google. ||
||<rowstyle="mso-yfti-irow:12"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q4''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">07/04/2015 to 07/23/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Based on the evaluation, existing work to be   modified based on suggestions by mentor. Coding work second half completion   deadline to be addressed as well ||
||<rowstyle="mso-yfti-irow:13"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q4''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">07/24/2015 to 08/02/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Testing work to be carried out on the system   during this phase of implementation. Existing code must be tested to handle   edge cases, exceptions and incompatibility issues addressed in the first and   second reports. ||
||<rowstyle="mso-yfti-irow:14"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q4''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">08/03/2015 to 08/10/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Documentation to be performed for all phases step by step and discussed   with mentor. FAQ section and Bug fixes to be documented as well. Final   changes to be made that will increase brevity and readability. ||
||<rowstyle="mso-yfti-irow:15"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q4''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">08/11/2015 to 08/16/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Proof-checking and submission of Final Report deadline preparation. Final   discussions with mentor for any loose ends and patches to be done to the   report. ||
||<rowstyle="mso-yfti-irow:16"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q4''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">08/17/2015 to 08/20/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Suggested 'pencils down' date. Take a week to scrub code, write tests,   improve documentation, etc. ||
||<rowstyle="mso-yfti-irow:17;mso-yfti-lastrow:yes"width="67px" style="border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">'''Q4''' ||<width="90px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">08/21/2015 ||<width="481px" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;vertical-align:top">Firm pencils down date and submission of report. ||




''' '''

'''References'''

[1]          https://issues.apache.org/jira/browse/NUTCH-1936

[2]          http://sunset.usc.edu/classes/cs572_2015/CS572_HW_NUTCH_POLAR.pdf

[3]          https://wiki.apache.org/nutch/

[4]          http://hadoop.apache.org/docs/stable/

[5]          https://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial

[6]          https://wiki.apache.org/nutch/AboutPlugins

[7]          http://www.atlantbh.com/wp-content/uploads/2012/03/Apache-Nutch-Flowchart-e1331646794565.png

[8]          https://wiki.apache.org/nutch/OptimizingCrawls

[9]          https://tika.apache.org/1.7/gettingstarted.html

[10]        http://lucene.apache.org/solr/quickstart.html

[11]        http://www.slideshare.net/RommelGarcia2/hadoop-1x-vs-2

[12]        https://issues.apache.org/jira/browse/NUTCH-1219

[13]        http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html

[14]        http://www.cloudera.com/content/cloudera/en/about/hadoop-and-big-data.html