You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2016/05/22 21:08:26 UTC
[Nutch Wiki] Update of "GoogleSummerOfCode/SecurityLayer" by kamaci
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "GoogleSummerOfCode/SecurityLayer" page has been changed by kamaci:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SecurityLayer
New page:
<<TableOfContents(4)>>
||'''Title :'''||||[[https://summerofcode.withgoogle.com/projects/#6099177868099584|GSOC 2016 Proposal]]||
||'''Issue :'''|||| [[https://issues.apache.org/jira/browse/NUTCH-1756|NUTCH-1756 - Security layer for NutchServer]]||
||'''Student :'''||||Furkan KAMACI - furkankamaci [at] gmail.com||
||'''Mentor :'''||||[[https://wiki.apache.org/nutch/LewisJohnMcgibbney|Lewis John McGibbney]]||
=== Abstract ===
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache
Lucene, the project has diversified and now comprises two codebases, namely:
Nutch 1.x and Nutch 2.x.
This proposal aims to develop a security layer for Nutch 2.x.
=== Additional Info ===
http://www.furkankamaci.com/
Furkan KAMACI
Istanbul Technical University
Graduate School of Science, Engineering and Technology
Computer Engineering
Istanbul, Turkey
=== 1. Introduction ===
Apache Nutch comprises two codebases, namely:
Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data
structures, which are great for batch processing.
Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted
away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we
can implement an extremely flexible model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks,
etc.) into a number of NoSQL storage solutions.
=== 2. Definition of the Problem ===
Nutch 2.x has a REST API and web application but it doesn't have a security layer on it. A security layer should be
implemented which covers security functionality (authentication, authorization), different authentication mechanisms ,
documentation and refactoring existing code. This project will therefore propose the design, development and implementation of
the security agenda as described above. This work will be specifically applicable to the Nutch 2.X codebase.
=== 3. Proposed Method ===
==== A. Background ====
There has been implemented an API which lets to interact with Nutch via REST API. Administration and configuration tasks
can be done via this API.
This proposal offers a comprehensive security layer under NUTCH1756. Existing code should be refactored, security layer
should be added and a documentation should be done programmatically (i.e. Miredot, Swagger).
==== B. Suggested Steps for Proposed Method ====
1) Authentication/Authorization Implementation
2) API Documentation Generator Implementation
=== 4. Schedule & Timeline ===
Suggested schedule and timeline is as follows:
1) ''Analyzing The Problem (1 Week 30 May 2016)''
a) Problem will be analyzed with more detail.
2) ''Authentication Implementation (5 weeks 4 July 2016)''
a) HTTP Basic authentication
b) HTTP Digest authentication
c) SSL client authentication
d) Kerberos Authentication
3) ''Authorization Implementation (3 weeks 25 July 2016)''
a) Authorization will be implemented
4) ''API Documentation (1 week 1 August 2016)''
a) API Documentation implementation
5) ''Test (1 week 8 August 2016)''
a) Implementation tests will be written and run.
6) ''Documentation (1 week 15 August 2016)''
a) Documentation will be prepared.
=== References ===
[1] https://issues.apache.org/jira/browse/NUTCH1756
[2] https://wiki.apache.org/nutch/NutchRESTAPI
[3] https://issues.apache.org/jira/browse/GORA386
[4] https://issues.apache.org/jira/browse/NUTCH2243
[5] https://issues.apache.org/jira/browse/NUTCH2022
[6] https://github.com/apache/nutch
[7] https://en.wikipedia.org/wiki/Apache_Nutch
[8] https://github.com/apache/gora
[9] http://en.wikipedia.org/wiki/Apache_Gora
[10] http://gora.apache.org/current/tutorial.html#introduction
=== Reports ===
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SecurityLayer/WeeklyReports|Weekly Reports]]
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SecurityLayer/MidtermReport|Midterm Report]]
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SecurityLayer/FinalReport|Final Report]]