You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ankit Goel <an...@gmail.com> on 2015/07/23 02:51:38 UTC

Nutch on the cloud

Hi,
After my runs on my lappy, I'm ready to port my work to the cloud. Planning
to use Amazon. One thing I noticed when I started with nutch that there
were a lot of things unsaid on the site/wiki and took me a lot of time to
figure out. Pitfalls if I may call them. I dont really have code or
scripts, but I need nutch to run all the time on the cloud.

So before I port to the cloud, are there any things I should beware of or
lookout for? Like is AWS fine with nutch? Are there any configurations I
should remember? Any advice on implementation to ease my transition and run
nutch 24hrs? i will be running a seed file and crawl the net in general.
Thanks

-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: Nutch on the cloud

Posted by Ankit Goel <an...@gmail.com>.
Hi Chris,
My user name is AnkitGoel.
Glad to be able to contribute. Thanks.

ps: tried to send u an email and got an auto response. congrats if i may
say so

On Thu, Jul 23, 2015 at 8:47 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Yes that would be fantastic. How about a wiki page on getting up
> and running and overcoming problems with the most recent Nutch?
>
> The Nutch wiki is here:
>
> http://wiki.apache.org/nutch/
>
> Please sign up for an account and tell me your username. Then I’ll
> grant you permissions to edit the wiki.
>
> Thank you Ankit!
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Ankit Goel <an...@gmail.com>
> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Date: Thursday, July 23, 2015 at 7:22 AM
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Subject: Re: Nutch on the cloud
>
> >Hey,
> >@Chris, I would love to help with the wiki (honored in fact), but my
> >inputs
> >are not with respect to the getting started process. More along the lines
> >of frequent errors after that. For example, the redirect plugin doesnt
> >work
> >how u expect it to (not even with the latest one). Or sometimes the
> >parsechecker will give results that a normal nutch run wont, even tho its
> >the same regex filter, or where to check it. Or which solr you need to
> >start with cause the 5.x has a diff file structure. Things like that on
> >which you spend a long.
> >
> >If there is a wiki for such a page I will gladly step up to the plate. It
> >isnt exactly faq either. I was thinking I could blog about it, but I think
> >ur idea of a wiki would be better so that it can be updated by later
> >authors as the problems are removed. Uh so should I create one on the
> >nutch
> >site? Also many of the problems are questioned multiple times  in the
> >mailing grp, and google search just doesnt cut it. So maybe a repository
> >of
> >frequent problems? that sort?
> >thanks for the heads up on the other guide. gave me a starting point.
> >
> >
> >On Thu, Jul 23, 2015 at 6:24 AM, Mattmann, Chris A (3980) <
> >chris.a.mattmann@jpl.nasa.gov> wrote:
> >
> >> Thanks Ankit for the honest feedback. Would you be willing to update
> >> our wiki and improve the instructions based on your experiences for
> >> our gotchas?
> >>
> >> We have a guide we have been working on ourselves to getting Nutch
> >> running and churning on ElasticMap Reduce. That’s where I’d recommend
> >> starting.
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Ankit Goel <an...@gmail.com>
> >> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >> Date: Wednesday, July 22, 2015 at 5:51 PM
> >> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >> Subject: Nutch on the cloud
> >>
> >> >Hi,
> >> >After my runs on my lappy, I'm ready to port my work to the cloud.
> >> >Planning
> >> >to use Amazon. One thing I noticed when I started with nutch that there
> >> >were a lot of things unsaid on the site/wiki and took me a lot of time
> >>to
> >> >figure out. Pitfalls if I may call them. I dont really have code or
> >> >scripts, but I need nutch to run all the time on the cloud.
> >> >
> >> >So before I port to the cloud, are there any things I should beware of
> >>or
> >> >lookout for? Like is AWS fine with nutch? Are there any configurations
> >>I
> >> >should remember? Any advice on implementation to ease my transition and
> >> >run
> >> >nutch 24hrs? i will be running a seed file and crawl the net in
> >>general.
> >> >Thanks
> >> >
> >> >--
> >> >Regards,
> >> >Ankit Goel
> >> >http://about.me/ankitgoel
> >>
> >>
> >
> >
> >--
> >Regards,
> >Ankit Goel
> >http://about.me/ankitgoel
>
>


-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: Nutch on the cloud

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Yes that would be fantastic. How about a wiki page on getting up
and running and overcoming problems with the most recent Nutch?

The Nutch wiki is here:

http://wiki.apache.org/nutch/

Please sign up for an account and tell me your username. Then I’ll
grant you permissions to edit the wiki.

Thank you Ankit!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Ankit Goel <an...@gmail.com>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Thursday, July 23, 2015 at 7:22 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Re: Nutch on the cloud

>Hey,
>@Chris, I would love to help with the wiki (honored in fact), but my
>inputs
>are not with respect to the getting started process. More along the lines
>of frequent errors after that. For example, the redirect plugin doesnt
>work
>how u expect it to (not even with the latest one). Or sometimes the
>parsechecker will give results that a normal nutch run wont, even tho its
>the same regex filter, or where to check it. Or which solr you need to
>start with cause the 5.x has a diff file structure. Things like that on
>which you spend a long.
>
>If there is a wiki for such a page I will gladly step up to the plate. It
>isnt exactly faq either. I was thinking I could blog about it, but I think
>ur idea of a wiki would be better so that it can be updated by later
>authors as the problems are removed. Uh so should I create one on the
>nutch
>site? Also many of the problems are questioned multiple times  in the
>mailing grp, and google search just doesnt cut it. So maybe a repository
>of
>frequent problems? that sort?
>thanks for the heads up on the other guide. gave me a starting point.
>
>
>On Thu, Jul 23, 2015 at 6:24 AM, Mattmann, Chris A (3980) <
>chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Thanks Ankit for the honest feedback. Would you be willing to update
>> our wiki and improve the instructions based on your experiences for
>> our gotchas?
>>
>> We have a guide we have been working on ourselves to getting Nutch
>> running and churning on ElasticMap Reduce. That’s where I’d recommend
>> starting.
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Ankit Goel <an...@gmail.com>
>> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Date: Wednesday, July 22, 2015 at 5:51 PM
>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Subject: Nutch on the cloud
>>
>> >Hi,
>> >After my runs on my lappy, I'm ready to port my work to the cloud.
>> >Planning
>> >to use Amazon. One thing I noticed when I started with nutch that there
>> >were a lot of things unsaid on the site/wiki and took me a lot of time
>>to
>> >figure out. Pitfalls if I may call them. I dont really have code or
>> >scripts, but I need nutch to run all the time on the cloud.
>> >
>> >So before I port to the cloud, are there any things I should beware of
>>or
>> >lookout for? Like is AWS fine with nutch? Are there any configurations
>>I
>> >should remember? Any advice on implementation to ease my transition and
>> >run
>> >nutch 24hrs? i will be running a seed file and crawl the net in
>>general.
>> >Thanks
>> >
>> >--
>> >Regards,
>> >Ankit Goel
>> >http://about.me/ankitgoel
>>
>>
>
>
>-- 
>Regards,
>Ankit Goel
>http://about.me/ankitgoel


Re: Nutch on the cloud

Posted by Ankit Goel <an...@gmail.com>.
Hey,
@Chris, I would love to help with the wiki (honored in fact), but my inputs
are not with respect to the getting started process. More along the lines
of frequent errors after that. For example, the redirect plugin doesnt work
how u expect it to (not even with the latest one). Or sometimes the
parsechecker will give results that a normal nutch run wont, even tho its
the same regex filter, or where to check it. Or which solr you need to
start with cause the 5.x has a diff file structure. Things like that on
which you spend a long.

If there is a wiki for such a page I will gladly step up to the plate. It
isnt exactly faq either. I was thinking I could blog about it, but I think
ur idea of a wiki would be better so that it can be updated by later
authors as the problems are removed. Uh so should I create one on the nutch
site? Also many of the problems are questioned multiple times  in the
mailing grp, and google search just doesnt cut it. So maybe a repository of
frequent problems? that sort?
thanks for the heads up on the other guide. gave me a starting point.


On Thu, Jul 23, 2015 at 6:24 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Thanks Ankit for the honest feedback. Would you be willing to update
> our wiki and improve the instructions based on your experiences for
> our gotchas?
>
> We have a guide we have been working on ourselves to getting Nutch
> running and churning on ElasticMap Reduce. That’s where I’d recommend
> starting.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Ankit Goel <an...@gmail.com>
> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Date: Wednesday, July 22, 2015 at 5:51 PM
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Subject: Nutch on the cloud
>
> >Hi,
> >After my runs on my lappy, I'm ready to port my work to the cloud.
> >Planning
> >to use Amazon. One thing I noticed when I started with nutch that there
> >were a lot of things unsaid on the site/wiki and took me a lot of time to
> >figure out. Pitfalls if I may call them. I dont really have code or
> >scripts, but I need nutch to run all the time on the cloud.
> >
> >So before I port to the cloud, are there any things I should beware of or
> >lookout for? Like is AWS fine with nutch? Are there any configurations I
> >should remember? Any advice on implementation to ease my transition and
> >run
> >nutch 24hrs? i will be running a seed file and crawl the net in general.
> >Thanks
> >
> >--
> >Regards,
> >Ankit Goel
> >http://about.me/ankitgoel
>
>


-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: Nutch on the cloud

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thanks Ankit for the honest feedback. Would you be willing to update
our wiki and improve the instructions based on your experiences for
our gotchas?

We have a guide we have been working on ourselves to getting Nutch
running and churning on ElasticMap Reduce. That’s where I’d recommend
starting.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Ankit Goel <an...@gmail.com>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Wednesday, July 22, 2015 at 5:51 PM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Nutch on the cloud

>Hi,
>After my runs on my lappy, I'm ready to port my work to the cloud.
>Planning
>to use Amazon. One thing I noticed when I started with nutch that there
>were a lot of things unsaid on the site/wiki and took me a lot of time to
>figure out. Pitfalls if I may call them. I dont really have code or
>scripts, but I need nutch to run all the time on the cloud.
>
>So before I port to the cloud, are there any things I should beware of or
>lookout for? Like is AWS fine with nutch? Are there any configurations I
>should remember? Any advice on implementation to ease my transition and
>run
>nutch 24hrs? i will be running a seed file and crawl the net in general.
>Thanks
>
>-- 
>Regards,
>Ankit Goel
>http://about.me/ankitgoel