You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/11/04 22:06:44 UTC

[jira] Commented: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

    [ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928355#action_12928355 ] 

Andrzej Bialecki  commented on NUTCH-932:
-----------------------------------------

Examples (with the db equivalent to the one in db.formatted.gz):

{code}
$ curl -s 'http://localhost:8192/nutch/db?fields=url&end=http://www.freebsd.org/&start=http://www.egothor.org/'| ./json_pp
[
  {
    "url": "http://www.egothor.org/"
  }, 
  {
    "url": "http://www.freebsd.org/"
  }
]
{code}

{code}
$ curl -s 'http://localhost:8192/nutch/db?fields=url,outlinks,markers,protocolStatus,parseStatus,contentType&start=http://www.getopt.org/&end=http://www.getopt.org/'| ./json_pp
[
  {
    "contentType": "text/html", 
    "url": "http://www.getopt.org/", 
    "markers": {
      "_updmrk_": "1288890451-1134865895"
    }, 
    "parseStatus": "success/ok (1/0), args=[]", 
    "protocolStatus": "SUCCESS, args=[]", 
    "outlinks": {
      "http://www.getopt.org/luke/": "Luke", 
      "http://www.getopt.org/ecimf/contrib/ONTO/REA": "REA Ontology page", 
      "http://www.getopt.org/CV.pdf": "CV here", 
      "http://www.getopt.org/utils/build/api": "API", 
      "http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/util/JenkinsHash.java": "available here", 
      "http://www.getopt.org/murmur/MurmurHash.java": "MurmurHash.java", 
      "http://www.ebxml.org/": "ebXML / ebTWG", 
      "http://www.freebsd.org/": "FreeBSD", 
      "http://www.getopt.org/luke/webstart.html": "Launch with Java WebStart", 
      "http://www.freebsd.org/%7Epicobsd": "PicoBSD", 
      "http://home.comcast.net/~bretm/hash/6.html": "this discussion", 
      "http://protege.stanford.edu/": "Protege", 
      "http://jakarta.apache.org/lucene": "Lucene", 
      "http://www.getopt.org/ecimf/contrib/ONTO/ebxml": "ebXML Ontology", 
      "http://www.getopt.org/ecimf/": "here", 
      "http://www.isthe.com/chongo/tech/comp/fnv/": "his website", 
      "http://www.getopt.org/stempel/index.html": "Stempel", 
      "http://www.sigram.com/": "SIGRAM", 
      "http://www.egothor.org/": "Egothor", 
      "http://thinlet.sourceforge.net/": "Thinlet", 
      "http://www.getopt.org/utils/dist/utils-1.0.jar": "binary", 
      "http://www.ecimf.org/": "ECIMF"
    }
  }
]
{code}


> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
>                 Key: NUTCH-932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-932
>             Project: Nutch
>          Issue Type: New Feature
>          Components: REST_api
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.