You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jessica Glover <gl...@gmail.com> on 2015/06/12 16:10:01 UTC

REST API for crawling

Hello. I am trying to test out the 2.3 REST API using curl, but I'm having
trouble with the commands. I found out what arguments to use for the inject
job from searching the archives, and that was successful, but when I try
generate with no args, it fails:

    {

        "args": {},

        "confId": "default",

        "crawlId": "crawl-01",

        "id": "crawl-01-default-GENERATE-94689123",

        "msg": "ERROR: java.lang.RuntimeException: job failed:
name=[crawl-01]generate: null, jobid=job_local473690964_0003",

        "result": null,

        "state": "FAILED",

        "type": "GENERATE"

    },


and when I try it with a topN argument, I get the same error about casting
an int to a long that a...@21decades.com reported.

Can anyone provide some guidance on how to figure out what arguments to
use? Right now I'm guessing based on the source code in GeneratorJob.java

Does anyone know of any example code out there that uses the REST API?


Thanks,

Jessica

Re: REST API for crawling

Posted by Alex <al...@21decades.com>.
Hi Jessica,

Try applying this patch to fix the ClassPathException for NutchGora: https://issues.apache.org/jira/browse/NUTCH-2019 <https://issues.apache.org/jira/browse/NUTCH-2019>

Some of the arguments can be found on org.apache.nutch.metadata.Nutch:
/** Batch id to select. */
  public static final String ARG_BATCH = "batch";
  /** Crawl id to use. */
  public static final String ARG_CRAWL = "crawl";
  /** Resume previously aborted op. */
  public static final String ARG_RESUME = "resume";
  /** Force processing even if there are locks or inconsistencies. */
  public static final String ARG_FORCE = "force";
  /** Sort statistics. */
  public static final String ARG_SORT = "sort";
  /** Solr URL. */
  public static final String ARG_SOLR = "solr";
  /** Number of fetcher threads (per map task). */
  public static final String ARG_THREADS = "threads";
  /** Number of fetcher tasks. */
  public static final String ARG_NUMTASKS = "numTasks";
  /** Generate topN scoring URLs. */
  public static final String ARG_TOPN = "topN";
  /** The notion of current time. */
  public static final String ARG_CURTIME = "curTime";
  /** Apply URLFilters. */
  public static final String ARG_FILTER = "filter";
  /** Apply URLNormalizers. */
  public static final String ARG_NORMALIZE = "normalize";
  /** Whitespace-separated list of seed URLs. */
  public static final String ARG_SEEDLIST = "seed";
  /** a path to a directory containing a list of seed URLs. */
  public static final String ARG_SEEDDIR = "seedDir";
  /** Class to run as a NutchTool. */
  public static final String ARG_CLASS = "class";
  /** Depth (number of cycles) of a crawl. */
  public static final String ARG_DEPTH = "depth";

Regards,
Alex
> On 12 Jun 2015, at 10:10 pm, Jessica Glover <gl...@gmail.com> wrote:
> 
> Hello. I am trying to test out the 2.3 REST API using curl, but I'm having
> trouble with the commands. I found out what arguments to use for the inject
> job from searching the archives, and that was successful, but when I try
> generate with no args, it fails:
> 
>    {
> 
>        "args": {},
> 
>        "confId": "default",
> 
>        "crawlId": "crawl-01",
> 
>        "id": "crawl-01-default-GENERATE-94689123",
> 
>        "msg": "ERROR: java.lang.RuntimeException: job failed:
> name=[crawl-01]generate: null, jobid=job_local473690964_0003",
> 
>        "result": null,
> 
>        "state": "FAILED",
> 
>        "type": "GENERATE"
> 
>    },
> 
> 
> and when I try it with a topN argument, I get the same error about casting
> an int to a long that a...@21decades.com reported.
> 
> Can anyone provide some guidance on how to figure out what arguments to
> use? Right now I'm guessing based on the source code in GeneratorJob.java
> 
> Does anyone know of any example code out there that uses the REST API?
> 
> 
> Thanks,
> 
> Jessica


Re: REST API for crawling

Posted by Jessica Glover <gl...@gmail.com>.
Oh my gosh, thank you!

On Fri, Jun 12, 2015 at 10:23 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Thank you Dzmitry!
>
> All, FYI too - Nutch 1.x has an actively developed REST API. We
> are targeting for integration as a mechanism for both the Nutch admin GUI
> (GSoC Project last summer) and for Memex Explorer (
> http://github.com/memex-explorer/memex-explorer). We are also building a
> Nutch python API that will use this.
>
> Cheers,
> Chris
>
> ________________________________________
> From: d.zenin [breedish@gmail.com]
> Sent: Friday, June 12, 2015 7:18 AM
> To: user@nutch.apache.org
> Subject: Re: REST API for crawling
>
> Hi Jessica,
>
> Two month ago i prepared a document that describes how to use NUTCH REST
> API with example. Hope it will helps you
>
> https://docs.google.com/document/d/1OGg22ATohapP2ycewIaTcUnENc2FeyYzni0ED_Jjxz8/edit?usp=sharing
>
>
> Best Regards,
> Dzmitry
>
> On Fri, Jun 12, 2015 at 5:10 PM, Jessica Glover <
> glover.jessica.m@gmail.com>
> wrote:
>
> > Hello. I am trying to test out the 2.3 REST API using curl, but I'm
> having
> > trouble with the commands. I found out what arguments to use for the
> inject
> > job from searching the archives, and that was successful, but when I try
> > generate with no args, it fails:
> >
> >     {
> >
> >         "args": {},
> >
> >         "confId": "default",
> >
> >         "crawlId": "crawl-01",
> >
> >         "id": "crawl-01-default-GENERATE-94689123",
> >
> >         "msg": "ERROR: java.lang.RuntimeException: job failed:
> > name=[crawl-01]generate: null, jobid=job_local473690964_0003",
> >
> >         "result": null,
> >
> >         "state": "FAILED",
> >
> >         "type": "GENERATE"
> >
> >     },
> >
> >
> > and when I try it with a topN argument, I get the same error about
> casting
> > an int to a long that a...@21decades.com reported.
> >
> > Can anyone provide some guidance on how to figure out what arguments to
> > use? Right now I'm guessing based on the source code in GeneratorJob.java
> >
> > Does anyone know of any example code out there that uses the REST API?
> >
> >
> > Thanks,
> >
> > Jessica
> >
>

RE: REST API for crawling

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thank you Dzmitry!

All, FYI too - Nutch 1.x has an actively developed REST API. We 
are targeting for integration as a mechanism for both the Nutch admin GUI (GSoC Project last summer) and for Memex Explorer (http://github.com/memex-explorer/memex-explorer). We are also building a Nutch python API that will use this.

Cheers,
Chris

________________________________________
From: d.zenin [breedish@gmail.com]
Sent: Friday, June 12, 2015 7:18 AM
To: user@nutch.apache.org
Subject: Re: REST API for crawling

Hi Jessica,

Two month ago i prepared a document that describes how to use NUTCH REST
API with example. Hope it will helps you
https://docs.google.com/document/d/1OGg22ATohapP2ycewIaTcUnENc2FeyYzni0ED_Jjxz8/edit?usp=sharing


Best Regards,
Dzmitry

On Fri, Jun 12, 2015 at 5:10 PM, Jessica Glover <gl...@gmail.com>
wrote:

> Hello. I am trying to test out the 2.3 REST API using curl, but I'm having
> trouble with the commands. I found out what arguments to use for the inject
> job from searching the archives, and that was successful, but when I try
> generate with no args, it fails:
>
>     {
>
>         "args": {},
>
>         "confId": "default",
>
>         "crawlId": "crawl-01",
>
>         "id": "crawl-01-default-GENERATE-94689123",
>
>         "msg": "ERROR: java.lang.RuntimeException: job failed:
> name=[crawl-01]generate: null, jobid=job_local473690964_0003",
>
>         "result": null,
>
>         "state": "FAILED",
>
>         "type": "GENERATE"
>
>     },
>
>
> and when I try it with a topN argument, I get the same error about casting
> an int to a long that a...@21decades.com reported.
>
> Can anyone provide some guidance on how to figure out what arguments to
> use? Right now I'm guessing based on the source code in GeneratorJob.java
>
> Does anyone know of any example code out there that uses the REST API?
>
>
> Thanks,
>
> Jessica
>

Re: REST API for crawling

Posted by "d.zenin" <br...@gmail.com>.
Hi Jessica,

Two month ago i prepared a document that describes how to use NUTCH REST
API with example. Hope it will helps you
https://docs.google.com/document/d/1OGg22ATohapP2ycewIaTcUnENc2FeyYzni0ED_Jjxz8/edit?usp=sharing


Best Regards,
Dzmitry

On Fri, Jun 12, 2015 at 5:10 PM, Jessica Glover <gl...@gmail.com>
wrote:

> Hello. I am trying to test out the 2.3 REST API using curl, but I'm having
> trouble with the commands. I found out what arguments to use for the inject
> job from searching the archives, and that was successful, but when I try
> generate with no args, it fails:
>
>     {
>
>         "args": {},
>
>         "confId": "default",
>
>         "crawlId": "crawl-01",
>
>         "id": "crawl-01-default-GENERATE-94689123",
>
>         "msg": "ERROR: java.lang.RuntimeException: job failed:
> name=[crawl-01]generate: null, jobid=job_local473690964_0003",
>
>         "result": null,
>
>         "state": "FAILED",
>
>         "type": "GENERATE"
>
>     },
>
>
> and when I try it with a topN argument, I get the same error about casting
> an int to a long that a...@21decades.com reported.
>
> Can anyone provide some guidance on how to figure out what arguments to
> use? Right now I'm guessing based on the source code in GeneratorJob.java
>
> Does anyone know of any example code out there that uses the REST API?
>
>
> Thanks,
>
> Jessica
>