You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tony Colletti <TC...@minitab.com> on 2015/06/26 22:00:28 UTC

Nutch REST API field results

After searching your site and then having to resort to S/O, I've finally figured out how to create a full crawl using each command to the REST endpoint. However, I've noticed that after my final step is done (UPDATEDB), I check my db and there are many fields missing. The ones I'm most concerned about is the "status" and "baseUrl" field. I'm not even sure if the crawl is actually being executed or not. I'm assuming it's something I have wrong. I've followed the examples in this<https://docs.google.com/document/d/1OGg22ATohapP2ycewIaTcUnENc2FeyYzni0ED_Jjxz8/edit> document that I found on another mailing list topic. What am I doing wrong? I'm using Nutch 2.3 and tying it into MongoDB as my database.

Also, I've found that even after just running the command to INJECT the seedlist, my db already has a new collection with information in it. That information is the same information in the end, so it never changes. But when checking the status of the other commands, they all say FINISHED and OK. What's going on?

Thanks for the help!

~ Tony

Re: Nutch REST API field results

Posted by "d.zenin" <br...@gmail.com>.

Hi Tony,

As i remember some phases in Nutch(INJECT, GENERATE, ...) set a specific
mark(marker field) - for example on inject phase "mk:_injmrk_" is set, for
GENERATE phase - "mk:_gnmrk_". It is also worth to point that phases
depends on results of execution of previous phases(e.g. FETCH will only
fetch urls that were successfully processed by GENERATE phase(gen mark is
set)). Check that you have such marks on your entries in collection. If you
have only inject mark it means that GENERATE phase didn't choose url to be
fetched. In this case you should check that you pass "curTime" parameter
with current timestamp after you did INJECT.

>From my experience - it is better to download Nutch sources and check what
it is doing from the code.

Hope that helps

Regards

Best Regards,
Dzmitry

On Fri, Jun 26, 2015 at 11:00 PM, Tony Colletti <TC...@minitab.com>
wrote:

> After searching your site and then having to resort to S/O, I've finally
> figured out how to create a full crawl using each command to the REST
> endpoint. However, I've noticed that after my final step is done
> (UPDATEDB), I check my db and there are many fields missing. The ones I'm
> most concerned about is the "status" and "baseUrl" field. I'm not even sure
> if the crawl is actually being executed or not. I'm assuming it's something
> I have wrong. I've followed the examples in this<
> https://docs.google.com/document/d/1OGg22ATohapP2ycewIaTcUnENc2FeyYzni0ED_Jjxz8/edit>
> document that I found on another mailing list topic. What am I doing wrong?
> I'm using Nutch 2.3 and tying it into MongoDB as my database.
>
> Also, I've found that even after just running the command to INJECT the
> seedlist, my db already has a new collection with information in it. That
> information is the same information in the end, so it never changes. But
> when checking the status of the other commands, they all say FINISHED and
> OK. What's going on?
>
> Thanks for the help!
>
> ~ Tony
>
>