You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/11/04 22:06:44 UTC
[jira] Commented: (NUTCH-932) Bulk REST API to retrieve crawl
results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928355#action_12928355 ]
Andrzej Bialecki commented on NUTCH-932:
-----------------------------------------
Examples (with the db equivalent to the one in db.formatted.gz):
{code}
$ curl -s 'http://localhost:8192/nutch/db?fields=url&end=http://www.freebsd.org/&start=http://www.egothor.org/'| ./json_pp
[
{
"url": "http://www.egothor.org/"
},
{
"url": "http://www.freebsd.org/"
}
]
{code}
{code}
$ curl -s 'http://localhost:8192/nutch/db?fields=url,outlinks,markers,protocolStatus,parseStatus,contentType&start=http://www.getopt.org/&end=http://www.getopt.org/'| ./json_pp
[
{
"contentType": "text/html",
"url": "http://www.getopt.org/",
"markers": {
"_updmrk_": "1288890451-1134865895"
},
"parseStatus": "success/ok (1/0), args=[]",
"protocolStatus": "SUCCESS, args=[]",
"outlinks": {
"http://www.getopt.org/luke/": "Luke",
"http://www.getopt.org/ecimf/contrib/ONTO/REA": "REA Ontology page",
"http://www.getopt.org/CV.pdf": "CV here",
"http://www.getopt.org/utils/build/api": "API",
"http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/util/JenkinsHash.java": "available here",
"http://www.getopt.org/murmur/MurmurHash.java": "MurmurHash.java",
"http://www.ebxml.org/": "ebXML / ebTWG",
"http://www.freebsd.org/": "FreeBSD",
"http://www.getopt.org/luke/webstart.html": "Launch with Java WebStart",
"http://www.freebsd.org/%7Epicobsd": "PicoBSD",
"http://home.comcast.net/~bretm/hash/6.html": "this discussion",
"http://protege.stanford.edu/": "Protege",
"http://jakarta.apache.org/lucene": "Lucene",
"http://www.getopt.org/ecimf/contrib/ONTO/ebxml": "ebXML Ontology",
"http://www.getopt.org/ecimf/": "here",
"http://www.isthe.com/chongo/tech/comp/fnv/": "his website",
"http://www.getopt.org/stempel/index.html": "Stempel",
"http://www.sigram.com/": "SIGRAM",
"http://www.egothor.org/": "Egothor",
"http://thinlet.sourceforge.net/": "Thinlet",
"http://www.getopt.org/utils/dist/utils-1.0.jar": "binary",
"http://www.ecimf.org/": "ECIMF"
}
}
]
{code}
> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
> Issue Type: New Feature
> Components: REST_api
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.