You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/08/11 11:49:16 UTC

[jira] Created: (NUTCH-880) REST API (and webapp) for Nutch

REST API (and webapp) for Nutch
-------------------------------

Key: NUTCH-880
URL: https://issues.apache.org/jira/browse/NUTCH-880
Project: Nutch
Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki

This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling JSON requests
* hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.

Open issues:

* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (NUTCH-880) REST API for Nutch

Posted by Tolga Soyata <to...@rochester.edu>.

Hi there, How do I remove myself (for now) from dev and user lists ?
I could not figure it out.

On Thu, Oct 28, 2010 at 5:22 AM, Andrzej Bialecki (JIRA) <ji...@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Andrzej Bialecki  updated NUTCH-880:
> ------------------------------------
>
>    Summary: REST API for Nutch  (was: REST API (and webapp) for Nutch)
>
> The webapp part is tracked now in NUTCH-929.
>
> > REST API for Nutch
> > ------------------
> >
> >                 Key: NUTCH-880
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-880
> >             Project: Nutch
> >          Issue Type: New Feature
> >    Affects Versions: 2.0
> >            Reporter: Andrzej Bialecki
> >            Assignee: Andrzej Bialecki
> >         Attachments: API-2.patch, API.patch
> >
> >
> > This issue is for discussing a REST-style API for accessing Nutch.
> > Here's an initial idea:
> > * I propose to use org.restlet for handling requests and returning
> JSON/XML/whatever responses.
> > * hook up all regular tools so that they can be driven via this API. This
> would have to be an async API, since all Nutch operations take long time to
> execute. It follows then that we need to be able also to list running
> operations, retrieve their current status, and possibly
> abort/cancel/stop/suspend/resume/...? This also means that we would have to
> potentially create & manage many threads in a servlet - AFAIK this is
> frowned upon by J2EE purists...
> > * package this in a webapp (that includes all deps, essentially nutch.job
> content), with the restlet servlet as an entry point.
> > Open issues:
> > * how to implement the reading of crawl results via this API
> > * should we manage only crawls that use a single configuration per
> webapp, or should we have a notion of crawl contexts (sets of crawl configs)
> with CRUD ops on them? this would be nice, because it would allow managing
> of several different crawls, with different configs, in a single webapp -
> but it complicates the implementation a lot.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-880:
------------------------------------

    Attachment: API-2.patch

An improved version, which actually works :) The configuration and job management is implemented, there is also a unit test that exercises this API.

If there are no objections I'd like to commit this first version of the API, and continue improving it in other issues.

> REST API (and webapp) for Nutch
> -------------------------------
>
>                 Key: NUTCH-880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-880
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: API-2.patch, API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-880:
------------------------------------

    Attachment: API.patch

Initial patch for discussion. This is a work in progress, so only some functionality is implemented, and even less than that is actually working ;)

I would appreciate a review and comments.

> REST API (and webapp) for Nutch
> -------------------------------
>
>                 Key: NUTCH-880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-880
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-880:
------------------------------------

    Description: 
This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
* hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.

Open issues:

* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

  was:
This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling JSON requests
* hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.

Open issues:

* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.


> REST API (and webapp) for Nutch
> -------------------------------
>
>                 Key: NUTCH-880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-880
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-880) REST API for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928909#action_12928909 ] 

Andrzej Bialecki  commented on NUTCH-880:
-----------------------------------------

Thanks - this issue is already fixed in NUTCH-932, to be committed soon.

> REST API for Nutch
> ------------------
>
>                 Key: NUTCH-880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-880
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: API-2.patch, API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (NUTCH-880) REST API (and webapp) for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  reassigned NUTCH-880:
---------------------------------------

    Assignee: Andrzej Bialecki 

> REST API (and webapp) for Nutch
> -------------------------------
>
>                 Key: NUTCH-880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-880
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-880) REST API for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved NUTCH-880.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.0

Committed in rev. 1028235. The webapp part of this issue is tracked now in NUTCH-929.

> REST API for Nutch
> ------------------
>
>                 Key: NUTCH-880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-880
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: API-2.patch, API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-880) REST API (and webapp) for Nutch

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913040#action_12913040 ] 

Doğacan Güney commented on NUTCH-880:
-------------------------------------

+1 from me. 

I think we can combine the approach you outlined in NUTCH-907 with this one. Instead of using confId-s to identify
different confs, we can use different crawl prefixes (or whatever we will call them) to identify different crawl sets (though
we still need a way to attach different conf-s to different crawl sets).

I think API overall looks good. Maybe we can change all the Map<String, Object>s to be some classes though.

A minor question:

In JobManager.java:

+  public static enum JobType {INJECT, GENERATE, FETCH, PARSE, UPDATEDB, INDEX, CRAWL, CLASS};

What is "CLASS"  ?

Btw, Andrzej, I will be happy to help out with the implementation if you want.

> REST API (and webapp) for Nutch
> -------------------------------
>
>                 Key: NUTCH-880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-880
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-880) REST API for Nutch

Posted by "Alexis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928896#action_12928896 ] 

Alexis commented on NUTCH-880:
------------------------------

This revision introduced a bug in the nutch inject command. It now throws a NullPointerException.

Please take a look at:
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/InjectorJob.java?annotate=1028235&pathrev=1028235

Make sure the first element in the array is not null:

{noformat}
Index: src/java/org/apache/nutch/crawl/InjectorJob.java
===================================================================
--- src/java/org/apache/nutch/crawl/InjectorJob.java    (revision 1031881)
+++ src/java/org/apache/nutch/crawl/InjectorJob.java    (working copy)
@@ -242,6 +242,7 @@
     job.setReducerClass(Reducer.class);
     job.setNumReduceTasks(0);
     job.waitForCompletion(true);
+    jobs[0] = job;

     job = new NutchJob(getConf(), "inject-p2 " + args[0]);
     StorageUtils.initMapperJob(job, FIELDS, String.class,
{noformat}


> REST API for Nutch
> ------------------
>
>                 Key: NUTCH-880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-880
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: API-2.patch, API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-880) REST API (and webapp) for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913118#action_12913118 ] 

Andrzej Bialecki  commented on NUTCH-880:
-----------------------------------------

bq. I think we can combine the approach you outlined in NUTCH-907 with this one.

I'm not sure... they are really not the same things - you can execute many crawls with different seed lists, but still using the same Configuration.

bq. What is "CLASS" ?

It's the same as bin/nutch fully.qualified.class.name, only here I require that it implements NutchTool.

bq. Btw, Andrzej, I will be happy to help out with the implementation if you want.

By all means - I didn't have time so far to progress beyond this patch...

> REST API (and webapp) for Nutch
> -------------------------------
>
>                 Key: NUTCH-880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-880
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-880) REST API for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-880:
------------------------------------

    Summary: REST API for Nutch  (was: REST API (and webapp) for Nutch)

The webapp part is tracked now in NUTCH-929.

> REST API for Nutch
> ------------------
>
>                 Key: NUTCH-880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-880
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: API-2.patch, API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.