You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/10/29 19:01:23 UTC

[jira] Created: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

Bulk REST API to retrieve crawl results as JSON
-----------------------------------------------

                 Key: NUTCH-932
                 URL: https://issues.apache.org/jira/browse/NUTCH-932
             Project: Nutch
          Issue Type: New Feature
          Components: REST_api
    Affects Versions: 2.0
            Reporter: Andrzej Bialecki 
            Assignee: Andrzej Bialecki 


It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:

* how to return bulk results using Restlet (WritableRepresentation subclass?)

* what should be the format of results?

I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved NUTCH-932.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.0

Committed in rev. 1039014.

> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
>                 Key: NUTCH-932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-932
>             Project: Nutch
>          Issue Type: New Feature
>          Components: REST_api
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-932:
------------------------------------

    Attachment: NUTCH-932-4.patch

Final version of the patch.

> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
>                 Key: NUTCH-932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-932
>             Project: Nutch
>          Issue Type: New Feature
>          Components: REST_api
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928355#action_12928355 ] 

Andrzej Bialecki  commented on NUTCH-932:
-----------------------------------------

Examples (with the db equivalent to the one in db.formatted.gz):

{code}
$ curl -s 'http://localhost:8192/nutch/db?fields=url&end=http://www.freebsd.org/&start=http://www.egothor.org/'| ./json_pp
[
  {
    "url": "http://www.egothor.org/"
  }, 
  {
    "url": "http://www.freebsd.org/"
  }
]
{code}

{code}
$ curl -s 'http://localhost:8192/nutch/db?fields=url,outlinks,markers,protocolStatus,parseStatus,contentType&start=http://www.getopt.org/&end=http://www.getopt.org/'| ./json_pp
[
  {
    "contentType": "text/html", 
    "url": "http://www.getopt.org/", 
    "markers": {
      "_updmrk_": "1288890451-1134865895"
    }, 
    "parseStatus": "success/ok (1/0), args=[]", 
    "protocolStatus": "SUCCESS, args=[]", 
    "outlinks": {
      "http://www.getopt.org/luke/": "Luke", 
      "http://www.getopt.org/ecimf/contrib/ONTO/REA": "REA Ontology page", 
      "http://www.getopt.org/CV.pdf": "CV here", 
      "http://www.getopt.org/utils/build/api": "API", 
      "http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/util/JenkinsHash.java": "available here", 
      "http://www.getopt.org/murmur/MurmurHash.java": "MurmurHash.java", 
      "http://www.ebxml.org/": "ebXML / ebTWG", 
      "http://www.freebsd.org/": "FreeBSD", 
      "http://www.getopt.org/luke/webstart.html": "Launch with Java WebStart", 
      "http://www.freebsd.org/%7Epicobsd": "PicoBSD", 
      "http://home.comcast.net/~bretm/hash/6.html": "this discussion", 
      "http://protege.stanford.edu/": "Protege", 
      "http://jakarta.apache.org/lucene": "Lucene", 
      "http://www.getopt.org/ecimf/contrib/ONTO/ebxml": "ebXML Ontology", 
      "http://www.getopt.org/ecimf/": "here", 
      "http://www.isthe.com/chongo/tech/comp/fnv/": "his website", 
      "http://www.getopt.org/stempel/index.html": "Stempel", 
      "http://www.sigram.com/": "SIGRAM", 
      "http://www.egothor.org/": "Egothor", 
      "http://thinlet.sourceforge.net/": "Thinlet", 
      "http://www.getopt.org/utils/dist/utils-1.0.jar": "binary", 
      "http://www.ecimf.org/": "ECIMF"
    }
  }
]
{code}


> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
>                 Key: NUTCH-932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-932
>             Project: Nutch
>          Issue Type: New Feature
>          Components: REST_api
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-932:
------------------------------------

    Attachment: NUTCH-932-3.patch

NutchTool is an abstract class in this patch. This actually minimizes the amount of code throughout, though paradoxically the patch file is larger than before...

> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
>                 Key: NUTCH-932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-932
>             Project: Nutch
>          Issue Type: New Feature
>          Components: REST_api
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-932:
------------------------------------

    Attachment: NUTCH-932.patch

This patch adds bulk retrieval of crawl results. This is still very rough, e.g. there's no way to select crawlId or limit the fields... but it returns proper JSON.

This patch also includes other enhancements and bugfixes - with this patch I was able to perform a complete crawl cycle via REST.

> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
>                 Key: NUTCH-932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-932
>             Project: Nutch
>          Issue Type: New Feature
>          Components: REST_api
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-932:
------------------------------------

    Attachment: NUTCH-932.patch

Updated patch. This changes the NutchTool API to allow for execution steps that are not mapreduce jobs, and to pass arguments in arbitrary order, which was a side-effect of the Restlet API.

As a proof of concept I reimplemented the Crawler class (a one-shot crawler). If there are no objections I'll commit this shortly.

> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
>                 Key: NUTCH-932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-932
>             Project: Nutch
>          Issue Type: New Feature
>          Components: REST_api
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-932:
------------------------------------

    Attachment: NUTCH-932-2.patch

This patch simplifies the NutchTool API and reduces changes to implementations of NutchTool. I'd like to commit this patch soon.

> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
>                 Key: NUTCH-932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-932
>             Project: Nutch
>          Issue Type: New Feature
>          Components: REST_api
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-932:
------------------------------------

    Attachment: db.formatted.gz

Example DB content (this was passed through a JSON pretty-printer, otherwise it's just one giant line...).

> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
>                 Key: NUTCH-932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-932
>             Project: Nutch
>          Issue Type: New Feature
>          Components: REST_api
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: db.formatted.gz, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-932:
------------------------------------

    Attachment: NUTCH-932.patch

Updated patch - this recognizes now URL parameters such as fields, start/end keys, batch and crawl id.

> Bulk REST API to retrieve crawl results as JSON
> -----------------------------------------------
>
>                 Key: NUTCH-932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-932
>             Project: Nutch
>          Issue Type: New Feature
>          Components: REST_api
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.