You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sachin Shaju <sa...@mstack.com> on 2016/10/04 13:18:48 UTC

Nutch as a service

Hi,
    I would like to know how nutch server works actually? Whether it use a
listener for incoming crawl requests or it is a continuously running
server?
Regards,
Sachin Shaju

sachin.s@mstack.com

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com

Re: Nutch as a service

Posted by Sachin Shaju <sa...@mstack.com>.

Hi Furkan,
             I've checked giving null for args. It didn't work either.
After investigating source code of *Fetcher.java* I've figured out it is
looking for segment in local path if a segment option is not added. If
segment option is added as a valid segment in hdfs it will work. I've
resolved that issue by returning segment path from generate phase in
results JSON in generate rest call. Added one or two lines in source code
of *Generator.java* file and it works. Am not sure if this is the way to do
this. But still it works.  Please write to me if there is any better option.

Everything works until index phase. Indexing to elasticsearch is failing by
throwing an unknown exception. Please have a look at
http://www.mail-archive.com/user%40nutch.apache.org/msg15001.html

Regards,
Sachin Shaju

sachin.s@mstack.com

On Thu, Oct 6, 2016 at 10:12 PM, Furkan KAMACI <fu...@gmail.com>
wrote:

> Hi Sachin,
>
> Could you check it again with sending *null* instead of *{}* ?
>
> Kind Regards,
> Furkan KAMACI
>
> On Thu, Oct 6, 2016 at 7:20 AM, Sachin Shaju <sa...@mstack.com> wrote:
>
> > Hi Sujen,
> >               Thanks for the reply. Actually that stackoverflow post was
> > created by me itself. :) I have some more queries.
> >  1. Do I have to run the server on hadoop namenode itself ?
> >  2. I have tested nutch server in hadoop. But on *fetch phase* it is
> > encountering *NullPointer* exception. That I can post here.
> > 16/10/05 18:53:59 ERROR impl.JobWorker: Cannot run job worker!
> >
> > java.lang.NullPointerException
> > at java.util.Arrays.sort(Arrays.java:1438)
> > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:564)
> > at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:71)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> >
> > I've checked source code. It is due to the absence of a parameter segment
> > in REST call for fetch. I'm expecting it to pick the latest segment
> > automatically. But it is not working that way.
> >
> > The request I've used is :-
> >
> > *POST /job/create*
> > *{   *
> > *    "type":"FETCH",*
> > *    "confId":"news",*
> > *    "crawlId":"crawl001",*
> > *    "args": {}*
> > *}*
> >
> > Am I missing anything here ?
> >
> >
> >
> >
> > Regards,
> > Sachin Shaju
> >
> > sachin.s@mstack.com
> > +919539887554
> >
> > On Thu, Oct 6, 2016 at 5:03 AM, Sujen Shah <su...@gmail.com> wrote:
> >
> > > Hi Sachin,
> > >
> > > Nutch REST API is built using Apache CXF framework and JAX-RS. The
> Nutch
> > > Server uses an embedded Jetty Server to service the http requests.
> > > You can find out more about CXF and Jetty here (
> > > http://cxf.apache.org/docs/overview.html).
> > >
> > > The server runs on one machine waiting for http requests. Once a
> request
> > is
> > > received it will start the respective Nutch Job requested (which might
> be
> > > distributed ex- fetch job)
> > >
> > >
> > > Just for visibility on the user list, this question was asked on
> > > stackoverflow. Link to the question and follow up discussion can be
> found
> > > at -
> > > http://stackoverflow.com/questions/39853492/working-of-
> > > nutch-server-in-distributed-mode
> > >
> > > Thanks
> > > Sujen
> > >
> > >
> > >
> > > Regards,
> > > Sujen Shah
> > > M.S - Computer Science
> > > University of Southern California
> > > http://www.linkedin.com/in/sujenshah
> > >
> > > On Tue, Oct 4, 2016 at 6:18 AM, Sachin Shaju <sa...@mstack.com>
> > wrote:
> > >
> > > > Hi,
> > > >     I would like to know how nutch server works actually? Whether it
> > use
> > > a
> > > > listener for incoming crawl requests or it is a continuously running
> > > > server?
> > > > Regards,
> > > > Sachin Shaju
> > > >
> > > > sachin.s@mstack.com
> > > >
> > > > --
> > > >
> > > >
> > > > The information contained in this electronic message and any
> > attachments
> > > to
> > > > this message are intended for the exclusive use of the addressee(s)
> and
> > > may
> > > > contain proprietary, confidential or privileged information. If you
> are
> > > not
> > > > the intended recipient, you should not disseminate, distribute or
> copy
> > > this
> > > > e-mail. Please notify the sender immediately and destroy all copies
> of
> > > this
> > > > message and any attachments.
> > > >
> > > > WARNING: Computer viruses can be transmitted via email. The recipient
> > > > should check this email and any attachments for the presence of
> > viruses.
> > > > The company accepts no liability for any damage caused by any virus
> > > > transmitted by this email.
> > > >
> > > > www.mStack.com
> > > >
> > >
> >
> > --
> >
> >
> > The information contained in this electronic message and any attachments
> to
> > this message are intended for the exclusive use of the addressee(s) and
> may
> > contain proprietary, confidential or privileged information. If you are
> not
> > the intended recipient, you should not disseminate, distribute or copy
> this
> > e-mail. Please notify the sender immediately and destroy all copies of
> this
> > message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The recipient
> > should check this email and any attachments for the presence of viruses.
> > The company accepts no liability for any damage caused by any virus
> > transmitted by this email.
> >
> > www.mStack.com
> >
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com

Re: Nutch as a service

Posted by Furkan KAMACI <fu...@gmail.com>.

Hi Sachin,

Could you check it again with sending *null* instead of *{}* ?

Kind Regards,
Furkan KAMACI

On Thu, Oct 6, 2016 at 7:20 AM, Sachin Shaju <sa...@mstack.com> wrote:

> Hi Sujen,
>               Thanks for the reply. Actually that stackoverflow post was
> created by me itself. :) I have some more queries.
>  1. Do I have to run the server on hadoop namenode itself ?
>  2. I have tested nutch server in hadoop. But on *fetch phase* it is
> encountering *NullPointer* exception. That I can post here.
> 16/10/05 18:53:59 ERROR impl.JobWorker: Cannot run job worker!
>
> java.lang.NullPointerException
> at java.util.Arrays.sort(Arrays.java:1438)
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:564)
> at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:71)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> I've checked source code. It is due to the absence of a parameter segment
> in REST call for fetch. I'm expecting it to pick the latest segment
> automatically. But it is not working that way.
>
> The request I've used is :-
>
> *POST /job/create*
> *{   *
> *    "type":"FETCH",*
> *    "confId":"news",*
> *    "crawlId":"crawl001",*
> *    "args": {}*
> *}*
>
> Am I missing anything here ?
>
>
>
>
> Regards,
> Sachin Shaju
>
> sachin.s@mstack.com
> +919539887554
>
> On Thu, Oct 6, 2016 at 5:03 AM, Sujen Shah <su...@gmail.com> wrote:
>
> > Hi Sachin,
> >
> > Nutch REST API is built using Apache CXF framework and JAX-RS. The Nutch
> > Server uses an embedded Jetty Server to service the http requests.
> > You can find out more about CXF and Jetty here (
> > http://cxf.apache.org/docs/overview.html).
> >
> > The server runs on one machine waiting for http requests. Once a request
> is
> > received it will start the respective Nutch Job requested (which might be
> > distributed ex- fetch job)
> >
> >
> > Just for visibility on the user list, this question was asked on
> > stackoverflow. Link to the question and follow up discussion can be found
> > at -
> > http://stackoverflow.com/questions/39853492/working-of-
> > nutch-server-in-distributed-mode
> >
> > Thanks
> > Sujen
> >
> >
> >
> > Regards,
> > Sujen Shah
> > M.S - Computer Science
> > University of Southern California
> > http://www.linkedin.com/in/sujenshah
> >
> > On Tue, Oct 4, 2016 at 6:18 AM, Sachin Shaju <sa...@mstack.com>
> wrote:
> >
> > > Hi,
> > >     I would like to know how nutch server works actually? Whether it
> use
> > a
> > > listener for incoming crawl requests or it is a continuously running
> > > server?
> > > Regards,
> > > Sachin Shaju
> > >
> > > sachin.s@mstack.com
> > >
> > > --
> > >
> > >
> > > The information contained in this electronic message and any
> attachments
> > to
> > > this message are intended for the exclusive use of the addressee(s) and
> > may
> > > contain proprietary, confidential or privileged information. If you are
> > not
> > > the intended recipient, you should not disseminate, distribute or copy
> > this
> > > e-mail. Please notify the sender immediately and destroy all copies of
> > this
> > > message and any attachments.
> > >
> > > WARNING: Computer viruses can be transmitted via email. The recipient
> > > should check this email and any attachments for the presence of
> viruses.
> > > The company accepts no liability for any damage caused by any virus
> > > transmitted by this email.
> > >
> > > www.mStack.com
> > >
> >
>
> --
>
>
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.mStack.com
>

Re: Nutch as a service

Posted by Sachin Shaju <sa...@mstack.com>.

Hi Sujen,
              Thanks for the reply. Actually that stackoverflow post was
created by me itself. :) I have some more queries.
 1. Do I have to run the server on hadoop namenode itself ?
 2. I have tested nutch server in hadoop. But on *fetch phase* it is
encountering *NullPointer* exception. That I can post here.
16/10/05 18:53:59 ERROR impl.JobWorker: Cannot run job worker!

java.lang.NullPointerException
at java.util.Arrays.sort(Arrays.java:1438)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:564)
at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:71)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I've checked source code. It is due to the absence of a parameter segment
in REST call for fetch. I'm expecting it to pick the latest segment
automatically. But it is not working that way.

The request I've used is :-

*POST /job/create*
*{   *
*    "type":"FETCH",*
*    "confId":"news",*
*    "crawlId":"crawl001",*
*    "args": {}*
*}*

Am I missing anything here ?




Regards,
Sachin Shaju

sachin.s@mstack.com
+919539887554

On Thu, Oct 6, 2016 at 5:03 AM, Sujen Shah <su...@gmail.com> wrote:

> Hi Sachin,
>
> Nutch REST API is built using Apache CXF framework and JAX-RS. The Nutch
> Server uses an embedded Jetty Server to service the http requests.
> You can find out more about CXF and Jetty here (
> http://cxf.apache.org/docs/overview.html).
>
> The server runs on one machine waiting for http requests. Once a request is
> received it will start the respective Nutch Job requested (which might be
> distributed ex- fetch job)
>
>
> Just for visibility on the user list, this question was asked on
> stackoverflow. Link to the question and follow up discussion can be found
> at -
> http://stackoverflow.com/questions/39853492/working-of-
> nutch-server-in-distributed-mode
>
> Thanks
> Sujen
>
>
>
> Regards,
> Sujen Shah
> M.S - Computer Science
> University of Southern California
> http://www.linkedin.com/in/sujenshah
>
> On Tue, Oct 4, 2016 at 6:18 AM, Sachin Shaju <sa...@mstack.com> wrote:
>
> > Hi,
> >     I would like to know how nutch server works actually? Whether it use
> a
> > listener for incoming crawl requests or it is a continuously running
> > server?
> > Regards,
> > Sachin Shaju
> >
> > sachin.s@mstack.com
> >
> > --
> >
> >
> > The information contained in this electronic message and any attachments
> to
> > this message are intended for the exclusive use of the addressee(s) and
> may
> > contain proprietary, confidential or privileged information. If you are
> not
> > the intended recipient, you should not disseminate, distribute or copy
> this
> > e-mail. Please notify the sender immediately and destroy all copies of
> this
> > message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The recipient
> > should check this email and any attachments for the presence of viruses.
> > The company accepts no liability for any damage caused by any virus
> > transmitted by this email.
> >
> > www.mStack.com
> >
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com

Re: Nutch as a service

Posted by Sujen Shah <su...@gmail.com>.

Hi Sachin,

Nutch REST API is built using Apache CXF framework and JAX-RS. The Nutch
Server uses an embedded Jetty Server to service the http requests.
You can find out more about CXF and Jetty here (
http://cxf.apache.org/docs/overview.html).

The server runs on one machine waiting for http requests. Once a request is
received it will start the respective Nutch Job requested (which might be
distributed ex- fetch job)

Just for visibility on the user list, this question was asked on
stackoverflow. Link to the question and follow up discussion can be found
at -
http://stackoverflow.com/questions/39853492/working-of-nutch-server-in-distributed-mode

Thanks
Sujen

Regards,
Sujen Shah
M.S - Computer Science
University of Southern California
http://www.linkedin.com/in/sujenshah

On Tue, Oct 4, 2016 at 6:18 AM, Sachin Shaju <sa...@mstack.com> wrote:

> Hi,
>     I would like to know how nutch server works actually? Whether it use a
> listener for incoming crawl requests or it is a continuously running
> server?
> Regards,
> Sachin Shaju
>
> sachin.s@mstack.com
>
> --
>
>
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.mStack.com
>