You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by lewis john mcgibbney <le...@apache.org> on 2016/10/18 06:27:23 UTC

Re: How to run nutch server on distributed environment

Hi Sachin,
Very late response I know but hopefully better later than never. Response
below

On Fri, Sep 30, 2016 at 5:04 AM, <us...@nutch.apache.org> wrote:

>
> From: Sachin Shaju <sa...@mstack.com>
> To: user@nutch.apache.org
> Cc:
> Date: Thu, 29 Sep 2016 14:01:13 +0530
> Subject: How to run nutch server on distributed environment
> Hi,
>
> I have tested running of nutch in server mode by starting it using
> bin/nutch startserver command*locally*. Now I wonder whether I can start
> nutch in *server mode* on top of a hadoop cluster(in distributed
> environment) and submit crawl requests to server using nutch REST api ?
> Please help.
>
>
I am assuming you are running Nutch master branch (as the command is
'startserver').
The answer is yes, as long as your Yarn cluster is running well and that
your memory considerations are well suited to your crawl datasets then you
will be good. If I were you I would spend a bit of time running test crawls
with various fetch lists and batch sizes ensuring that you have no memory
issues and that your containers are not killed by ApplicationMaster.

On the Nutch side, please note that right now, when you POST a list(s) or
seed(s) they are cached in /var/something/something on the server running
Nutchserver NOT on HDFS meaning that you somehow need to get them onto HDFS
before you can use your seed list within the INJECT url_dir parameter.

If you need any help with this then simply consult the very helpful
documentation put together by Sujen at
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI
Let us know how you get on as the REST is very handy indeed. It would be
nice to build it into deployment managers such as Ambari in the future.

Lewis