You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2014/06/20 05:22:56 UTC

[Nutch Wiki] Trivial Update of "FirstReport" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "FirstReport" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/FirstReport

New page:
= Google Summer of Code 2014 Report 1 =

'''Project Name''': NUTCH-841 Create a Wicket-based Web Application for Nutch 2.X
'''Report date''': 19th June 2014
'''Student Name''': Fjodor Vershinin 
'''Mentor Name''': Lewis John McGibbney (lewismc)

== Project description == 
Main goal of this project is to create an Apache Wicket-based Web GUI for Apache Nutch 2.X.

== Review of Previous Actions ==
No actions are available as this is the first report.

==Objectives==
===Explore Nutch documentation===
Firstly, I decided to make some background research about Apache Nutch in Russian segment of Internet. Then I briefly reviewed Nutch wiki, but especially paid attention to GUI specification page and Nutch tutorials.

===Workspace setup===
Nutch compilation process is built with Ant+Ivy. I'd previously experience with only Maven, so I've spend some time on workspace setup. Also, I prefer to use Mercurial because of very continent MQ plugin, which I really missing in SVN and GIT. Fortunately, Mercurial has also SVN/GIT bridges, so it was not that troublesome to configure it. As a benefit I got very convenient tool to work with patch queues.

===Contributions to Nutch community===
NUTCH-1731 I'd fixed NPE bug, added ability to stop remote server, and re-factored code so it uses Apache commons-cli library for command line parsing and automatic help message generation.

===Some experimental crawling===
Then I decided to make some test crawling sessions, in order to get into Nutch mechanics and plugin configuration. Despite very verbose wiki documentation it took some time to start, because of Hbase, Elasticsearch, and Hadoop setup. Using this software stack I'd crawled some web pages and got some Nutch user experience. 

===Review previous GUI application===
I'd checkout legacy GUI application from Github. Despite that I'd spend some time to get it running, it was quite useful, because I get information about most important features which should be implemented in the first row. From my point of view, most important features were:
 1.	Seed file upload
 2.	Component which gives ability to change nutch default settings.
 3.	Search in crawling results
 4.	Crawling statistics
 5.	Authentication
 6.	Instances management
 7.	Plugins support

===Create HTML prototype===
I created HTML prototype to make some experiments with Twitter Bootstrap and as a final result import legacy application HTML pages into my project. Wicket gives ability to use such prototype pages without any problem just by adding wicket:id properties.

===Nutch REST API research===
My first serious meeting with Nutch code base. I decided that it will be good entry point if I start with REST API. Then I can better understand, what features have been already implemented on server side, which drawbacks there exists and how application works. It was quite useful despite it took quite a lot of time because of code complexity and cohesion. After this step I'd got better experience with Nutch internal architecture and source code.

===Refactoring of NUTCH Rest API===
After reviewing of this API I realized, that code requires refactoring. Most of this code has been written two years ago with plain old Restlet library. We decided that I can use JAX-RS interface, because it is standard for modern REST services. Restlet has been left as back-end, but it can be replaced now without any effort. Also, refactoring has decreased code cohesion and complexity. These changes were proposed as patch in NUTCH-1769 issue. Also, I would provide unit/integration tests for REST API, but it is out of GSoC scope.

===Implement application skeleton===
Most time consuming part. I'd spend some time on exploring technologies, such as embedded Jetty, Wicket Spring integration and so on. Then I asked for some help from professional Java web-developer, in order to get some advice how to setup initial application skeleton. He gave me some Wicket samples from his projects and routed me to the right documentation about Wicket and Spring. Implemented features in application skeleton:
 * RESTful client to Nutch API
 * GUI application configuration component
 * Base page structures
 * Navigation
 * Twitter Bootstrap support
 * Unit tests
 * Integration tests
 * Plugin support
(mercurial repository https://bitbucket.org/feodorv/uinutch)

===Future Actions===