You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by WebDawg <we...@gmail.com> on 2016/10/10 18:01:03 UTC

Nutch 2.3.1

Hello,

I successfully have webapp and nutchserver running and I would like to
know more about the API and if it is functional.

I am trying to hack into it and wonder what the relationship between
the different config urls and configs.

Any help on this?  I would like to figure out how this works.  Does it
reference all the files in the conf dir?  If I do a crawl is it the
same as executing a crawl via command line?

Re: Nutch 2.3.1

Posted by Néstor <ro...@gmail.com>.
Can you send it to me also?
Thanks,

Néstor

On Oct 10, 2016 9:33 PM, "MrSrivastavaRK ." <sr...@gmail.com> wrote:
>
> Hi,
> I have successfully indexed content in Elasticsearch using Nutch 1.12 REST
> API. I can send you api details, If you want for reference.
>
> Regards
> Rajeev
>
> On Oct 10, 2016 11:31 PM, "WebDawg" <we...@gmail.com> wrote:
>
> > Hello,
> >
> > I successfully have webapp and nutchserver running and I would like to
> > know more about the API and if it is functional.
> >
> > I am trying to hack into it and wonder what the relationship between
> > the different config urls and configs.
> >
> > Any help on this?  I would like to figure out how this works.  Does it
> > reference all the files in the conf dir?  If I do a crawl is it the
> > same as executing a crawl via command line?
> >

Re: Nutch 2.3.1

Posted by "MrSrivastavaRK ." <sr...@gmail.com>.
Hi ,

I have did a POC for indexing the content in ES using Nutch 1.12 .. See
REST API details. I executed  bin/nutch startserver --port 9090 in local
mode By default nutch will create folders in bin directory for each crawl
request based on crawlid parameter.

POST /config/create
{
      "configId":"ereader",
      "force":"true",
      "params":{"http.agent.name":"elasticnutchrest",
                "http.robots.agents":"elasticnutchrest",
                "http.timeout":"1000000",
"plugin.includes":">protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic",
"index.metadata":"title,content",
"index.parse.md": "metatag.title,metatag.content",
"elastic.host":"localhost"

}
   }

POST seed/create/
{
    "id": "101",
    "name": "ereader",
    "seedUrls": [
        {
            "id": "1",
            "url": "​https://www.example.com"
        }
    ]
}

POST job/create  -- Inject
{
"args": {
         "url_dir": "/tmp/1475832312548-0"

    },
    "confId": "default",
    "crawlId": "crawl01",
    "type": "INJECT"
}




---GENERATE
{
    "type":"GENERATE",
    "confId": "default",
    "crawlId": "crawl01",
    "args": {
          "segments_dir":"/bin/crawl01/segments"

    }
}


---Fetch

{
    "args": {
      "segment_dir":"/bin/crawl01/segments/20161007152133",     //input path
      "threads":"50"
    },
    "confId": "default",
    "crawlId": "crawl01",
    "type": "FETCH"
}



---PARSE
{
    "args": {
      "segment_dir":"/bin/crawl01/segments/20161007152133",  //input path
      "threads":"50"
    },
    "confId": "default",
    "crawlId": "crawl01",
    "type": "PARSE"
}


---UpdateDB

{
    "args": {

 "segment_dir":"/home/osboxes/trunk/runtime/local/bin/crawl01/segments/20161007152133"
   //full input path
    },
    "confId": "default",
    "crawlId": "­crawl01",
    "type": "UPDATEDB"
}


---Index
{
    "args": {

 "segment_dir":"/home/osboxes/trunk/runtime/local/bin/crawl01/segments/20161007152133"
   //full input path
    },
    "confId": "default",
    "crawlId": "­crawl01",
    "type": "INDEX"
}



Hope this help. let me know, if need any clarification.


On Tue, Oct 11, 2016 at 8:08 PM, Sujen Shah <su...@gmail.com> wrote:

> Hi
> You could find the rest api documentation for Nutch 1.x here
> https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI and for
> Nutch 2.X here
> https://wiki.apache.org/nutch/NutchRESTAPI
>
> I am in the process of reviewing and updating it if any thing is
> inconsistent, there have been changes in Nutch 1.x rest service since its
> under active development.
>
> It'd be great if you could give it a try and report any issues.
>
> Thank you!
>
> Regards,
> Sujen Shah
>
> On Oct 11, 2016 7:09 AM, "WebDawg" <we...@gmail.com> wrote:
>
> > I would please very much like this.
> >
> > I was thinking about talking to the devs eventually, the documentation
> > seems non existent.
> >
> > I suppose it is reading the source/working with that is there?
> >
> > On Mon, Oct 10, 2016 at 11:33 PM, MrSrivastavaRK .
> > <sr...@gmail.com> wrote:
> > > Hi,
> > > I have successfully indexed content in Elasticsearch using Nutch 1.12
> > REST
> > > API. I can send you api details, If you want for reference.
> > >
> > > Regards
> > > Rajeev
> > >
> > > On Oct 10, 2016 11:31 PM, "WebDawg" <we...@gmail.com> wrote:
> > >
> > >> Hello,
> > >>
> > >> I successfully have webapp and nutchserver running and I would like to
> > >> know more about the API and if it is functional.
> > >>
> > >> I am trying to hack into it and wonder what the relationship between
> > >> the different config urls and configs.
> > >>
> > >> Any help on this?  I would like to figure out how this works.  Does it
> > >> reference all the files in the conf dir?  If I do a crawl is it the
> > >> same as executing a crawl via command line?
> > >>
> >
>



-- 
Regards
Rajeev K. Srivastava

Re: Nutch 2.3.1

Posted by Sujen Shah <su...@gmail.com>.
Hi
You could find the rest api documentation for Nutch 1.x here
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI and for
Nutch 2.X here
https://wiki.apache.org/nutch/NutchRESTAPI

I am in the process of reviewing and updating it if any thing is
inconsistent, there have been changes in Nutch 1.x rest service since its
under active development.

It'd be great if you could give it a try and report any issues.

Thank you!

Regards,
Sujen Shah

On Oct 11, 2016 7:09 AM, "WebDawg" <we...@gmail.com> wrote:

> I would please very much like this.
>
> I was thinking about talking to the devs eventually, the documentation
> seems non existent.
>
> I suppose it is reading the source/working with that is there?
>
> On Mon, Oct 10, 2016 at 11:33 PM, MrSrivastavaRK .
> <sr...@gmail.com> wrote:
> > Hi,
> > I have successfully indexed content in Elasticsearch using Nutch 1.12
> REST
> > API. I can send you api details, If you want for reference.
> >
> > Regards
> > Rajeev
> >
> > On Oct 10, 2016 11:31 PM, "WebDawg" <we...@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> I successfully have webapp and nutchserver running and I would like to
> >> know more about the API and if it is functional.
> >>
> >> I am trying to hack into it and wonder what the relationship between
> >> the different config urls and configs.
> >>
> >> Any help on this?  I would like to figure out how this works.  Does it
> >> reference all the files in the conf dir?  If I do a crawl is it the
> >> same as executing a crawl via command line?
> >>
>

Re: Nutch 2.3.1

Posted by WebDawg <we...@gmail.com>.
I would please very much like this.

I was thinking about talking to the devs eventually, the documentation
seems non existent.

I suppose it is reading the source/working with that is there?

On Mon, Oct 10, 2016 at 11:33 PM, MrSrivastavaRK .
<sr...@gmail.com> wrote:
> Hi,
> I have successfully indexed content in Elasticsearch using Nutch 1.12 REST
> API. I can send you api details, If you want for reference.
>
> Regards
> Rajeev
>
> On Oct 10, 2016 11:31 PM, "WebDawg" <we...@gmail.com> wrote:
>
>> Hello,
>>
>> I successfully have webapp and nutchserver running and I would like to
>> know more about the API and if it is functional.
>>
>> I am trying to hack into it and wonder what the relationship between
>> the different config urls and configs.
>>
>> Any help on this?  I would like to figure out how this works.  Does it
>> reference all the files in the conf dir?  If I do a crawl is it the
>> same as executing a crawl via command line?
>>

Re: Nutch 2.3.1

Posted by "MrSrivastavaRK ." <sr...@gmail.com>.
Hi,
I have successfully indexed content in Elasticsearch using Nutch 1.12 REST
API. I can send you api details, If you want for reference.

Regards
Rajeev

On Oct 10, 2016 11:31 PM, "WebDawg" <we...@gmail.com> wrote:

> Hello,
>
> I successfully have webapp and nutchserver running and I would like to
> know more about the API and if it is functional.
>
> I am trying to hack into it and wonder what the relationship between
> the different config urls and configs.
>
> Any help on this?  I would like to figure out how this works.  Does it
> reference all the files in the conf dir?  If I do a crawl is it the
> same as executing a crawl via command line?
>