You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/10/29 19:01:23 UTC
[jira] Created: (NUTCH-932) Bulk REST API to retrieve crawl results
as JSON
Bulk REST API to retrieve crawl results as JSON
-----------------------------------------------
Key: NUTCH-932
URL: https://issues.apache.org/jira/browse/NUTCH-932
Project: Nutch
Issue Type: New Feature
Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
* how to return bulk results using Restlet (WritableRepresentation subclass?)
* what should be the format of results?
I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-932) Bulk REST API to retrieve crawl
results as JSON
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki resolved NUTCH-932.
-------------------------------------
Resolution: Fixed
Fix Version/s: 2.0
Committed in rev. 1039014.
> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
> Issue Type: New Feature
> Components: REST_api
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Fix For: 2.0
>
> Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results
as JSON
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki updated NUTCH-932:
------------------------------------
Attachment: NUTCH-932-4.patch
Final version of the patch.
> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
> Issue Type: New Feature
> Components: REST_api
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-932) Bulk REST API to retrieve crawl
results as JSON
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928355#action_12928355 ]
Andrzej Bialecki commented on NUTCH-932:
-----------------------------------------
Examples (with the db equivalent to the one in db.formatted.gz):
{code}
$ curl -s 'http://localhost:8192/nutch/db?fields=url&end=http://www.freebsd.org/&start=http://www.egothor.org/'| ./json_pp
[
{
"url": "http://www.egothor.org/"
},
{
"url": "http://www.freebsd.org/"
}
]
{code}
{code}
$ curl -s 'http://localhost:8192/nutch/db?fields=url,outlinks,markers,protocolStatus,parseStatus,contentType&start=http://www.getopt.org/&end=http://www.getopt.org/'| ./json_pp
[
{
"contentType": "text/html",
"url": "http://www.getopt.org/",
"markers": {
"_updmrk_": "1288890451-1134865895"
},
"parseStatus": "success/ok (1/0), args=[]",
"protocolStatus": "SUCCESS, args=[]",
"outlinks": {
"http://www.getopt.org/luke/": "Luke",
"http://www.getopt.org/ecimf/contrib/ONTO/REA": "REA Ontology page",
"http://www.getopt.org/CV.pdf": "CV here",
"http://www.getopt.org/utils/build/api": "API",
"http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/util/JenkinsHash.java": "available here",
"http://www.getopt.org/murmur/MurmurHash.java": "MurmurHash.java",
"http://www.ebxml.org/": "ebXML / ebTWG",
"http://www.freebsd.org/": "FreeBSD",
"http://www.getopt.org/luke/webstart.html": "Launch with Java WebStart",
"http://www.freebsd.org/%7Epicobsd": "PicoBSD",
"http://home.comcast.net/~bretm/hash/6.html": "this discussion",
"http://protege.stanford.edu/": "Protege",
"http://jakarta.apache.org/lucene": "Lucene",
"http://www.getopt.org/ecimf/contrib/ONTO/ebxml": "ebXML Ontology",
"http://www.getopt.org/ecimf/": "here",
"http://www.isthe.com/chongo/tech/comp/fnv/": "his website",
"http://www.getopt.org/stempel/index.html": "Stempel",
"http://www.sigram.com/": "SIGRAM",
"http://www.egothor.org/": "Egothor",
"http://thinlet.sourceforge.net/": "Thinlet",
"http://www.getopt.org/utils/dist/utils-1.0.jar": "binary",
"http://www.ecimf.org/": "ECIMF"
}
}
]
{code}
> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
> Issue Type: New Feature
> Components: REST_api
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results
as JSON
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki updated NUTCH-932:
------------------------------------
Attachment: NUTCH-932-3.patch
NutchTool is an abstract class in this patch. This actually minimizes the amount of code throughout, though paradoxically the patch file is larger than before...
> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
> Issue Type: New Feature
> Components: REST_api
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results
as JSON
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki updated NUTCH-932:
------------------------------------
Attachment: NUTCH-932.patch
This patch adds bulk retrieval of crawl results. This is still very rough, e.g. there's no way to select crawlId or limit the fields... but it returns proper JSON.
This patch also includes other enhancements and bugfixes - with this patch I was able to perform a complete crawl cycle via REST.
> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
> Issue Type: New Feature
> Components: REST_api
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results
as JSON
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki updated NUTCH-932:
------------------------------------
Attachment: NUTCH-932.patch
Updated patch. This changes the NutchTool API to allow for execution steps that are not mapreduce jobs, and to pass arguments in arbitrary order, which was a side-effect of the Restlet API.
As a proof of concept I reimplemented the Crawler class (a one-shot crawler). If there are no objections I'll commit this shortly.
> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
> Issue Type: New Feature
> Components: REST_api
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results
as JSON
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki updated NUTCH-932:
------------------------------------
Attachment: NUTCH-932-2.patch
This patch simplifies the NutchTool API and reduces changes to implementations of NutchTool. I'd like to commit this patch soon.
> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
> Issue Type: New Feature
> Components: REST_api
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results
as JSON
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki updated NUTCH-932:
------------------------------------
Attachment: db.formatted.gz
Example DB content (this was passed through a JSON pretty-printer, otherwise it's just one giant line...).
> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
> Issue Type: New Feature
> Components: REST_api
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: db.formatted.gz, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results
as JSON
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki updated NUTCH-932:
------------------------------------
Attachment: NUTCH-932.patch
Updated patch - this recognizes now URL parameters such as fields, start/end keys, batch and crawl id.
> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
> Issue Type: New Feature
> Components: REST_api
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.