You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/08/22 22:56:49 UTC

The crawl command, keep or get rid of

Hi,

The crawl command seems to add a lot of confusion. It hides the entire crawl 
cycle logic from new users, leading to questions, lack of understanding of 
basic Nutch concepts, unsupported switches of the jobs it executes, more 
problems etc. I am quite an opponent of the crawl command and would also not 
recommend it to anyone including new users. A running Nutch almost always 
requires some scripting here and there, cron jobs, locks etc.

I propose (most likely a challenging statement) to deprecate the crawl command 
in 1.4.

Users, developers, please comment. 

Thanks

Re: The crawl command, keep or get rid of

Posted by Markus Jelsma <ma...@openindex.io>.
You're right: https://issues.apache.org/jira/browse/NUTCH-1087

On Tuesday 23 August 2011 13:24:27 Julien Nioche wrote:
> > What kind of shell script did you have in mind? The wiki already provides
> > some
> > useful scripts. The tutorials on Nutch also show commands that can be
> > used in
> > custom scripts.
> 
> That's exactly my point. There are various scripts in the wiki, based on
> different versions of Nutch and of variable quality (e.g. some  won't work
> in distributed mode) etc... Let's have one in the repository so that people
> stop reinventing the wheel or ask where to get one.
> Of course most of the script will examplify the commands from the Wiki and
> it will have a good educational value as well as being useful
> 
> Julien
> 
> > Is an immediate crawl-with-one-command a desired feature? Provided as
> > Java code or shell script?
> > 
> > On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
> > > +1 let's replace it with a shell script instead.
> > > 
> > > On 22 August 2011 21:56, Markus Jelsma <ma...@openindex.io>
> > 
> > wrote:
> > > > Hi,
> > > > 
> > > > The crawl command seems to add a lot of confusion. It hides the
> > > > entire crawl
> > > > cycle logic from new users, leading to questions, lack of
> > > > understanding of basic Nutch concepts, unsupported switches of the
> > > > jobs it executes, more problems etc. I am quite an opponent of the
> > > > crawl command and
> > 
> > would
> > 
> > > > also not
> > > > recommend it to anyone including new users. A running Nutch almost
> > 
> > always
> > 
> > > > requires some scripting here and there, cron jobs, locks etc.
> > > > 
> > > > I propose (most likely a challenging statement) to deprecate the
> > > > crawl command
> > > > in 1.4.
> > > > 
> > > > Users, developers, please comment.
> > > > 
> > > > Thanks
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: The crawl command, keep or get rid of

Posted by Julien Nioche <li...@gmail.com>.
> What kind of shell script did you have in mind? The wiki already provides
> some
> useful scripts. The tutorials on Nutch also show commands that can be used
> in
> custom scripts.
>

That's exactly my point. There are various scripts in the wiki, based on
different versions of Nutch and of variable quality (e.g. some  won't work
in distributed mode) etc... Let's have one in the repository so that people
stop reinventing the wheel or ask where to get one.
Of course most of the script will examplify the commands from the Wiki and
it will have a good educational value as well as being useful

Julien


> Is an immediate crawl-with-one-command a desired feature? Provided as Java
> code or shell script?
>
> On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
> > +1 let's replace it with a shell script instead.
> >
> > On 22 August 2011 21:56, Markus Jelsma <ma...@openindex.io>
> wrote:
> > > Hi,
> > >
> > > The crawl command seems to add a lot of confusion. It hides the entire
> > > crawl
> > > cycle logic from new users, leading to questions, lack of understanding
> > > of basic Nutch concepts, unsupported switches of the jobs it executes,
> > > more problems etc. I am quite an opponent of the crawl command and
> would
> > > also not
> > > recommend it to anyone including new users. A running Nutch almost
> always
> > > requires some scripting here and there, cron jobs, locks etc.
> > >
> > > I propose (most likely a challenging statement) to deprecate the crawl
> > > command
> > > in 1.4.
> > >
> > > Users, developers, please comment.
> > >
> > > Thanks
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: The crawl command, keep or get rid of

Posted by Eric Pugh <ep...@opensourceconnections.com>.
I wonder if the name "crawl" implies that the command is sort of standard command, and all you would need?  After all, if I where to sit down with a "crawler", it seems very logical that "crawl" would be how you run it!  I like the simplicity of crawl from a "getting started" approach.  I agree though that I know I used it as a short cut...  I didn't want to learn all the lower level concepts, I just wanted to crawl a couple URLs and toss them into Solr.  "crawl" and the example code did great!

Maybe instead of having "crawl" be a core part of running Nutch, instead it's "run-example-crawl.sh" and in the Wiki it's caveated that you should then look inside it and learn all the various steps.  

Eric


On Aug 23, 2011, at 6:50 AM, Markus Jelsma wrote:

> What kind of shell script did you have in mind? The wiki already provides some 
> useful scripts. The tutorials on Nutch also show commands that can be used in 
> custom scripts.
> 
> Is an immediate crawl-with-one-command a desired feature? Provided as Java 
> code or shell script?
> 
> On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
>> +1 let's replace it with a shell script instead.
>> 
>> On 22 August 2011 21:56, Markus Jelsma <ma...@openindex.io> wrote:
>>> Hi,
>>> 
>>> The crawl command seems to add a lot of confusion. It hides the entire
>>> crawl
>>> cycle logic from new users, leading to questions, lack of understanding
>>> of basic Nutch concepts, unsupported switches of the jobs it executes,
>>> more problems etc. I am quite an opponent of the crawl command and would
>>> also not
>>> recommend it to anyone including new users. A running Nutch almost always
>>> requires some scripting here and there, cron jobs, locks etc.
>>> 
>>> I propose (most likely a challenging statement) to deprecate the crawl
>>> command
>>> in 1.4.
>>> 
>>> Users, developers, please comment.
>>> 
>>> Thanks
> 
> -- 
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350

-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.










Re: The crawl command, keep or get rid of

Posted by Markus Jelsma <ma...@openindex.io>.
What kind of shell script did you have in mind? The wiki already provides some 
useful scripts. The tutorials on Nutch also show commands that can be used in 
custom scripts.

Is an immediate crawl-with-one-command a desired feature? Provided as Java 
code or shell script?

On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
> +1 let's replace it with a shell script instead.
> 
> On 22 August 2011 21:56, Markus Jelsma <ma...@openindex.io> wrote:
> > Hi,
> > 
> > The crawl command seems to add a lot of confusion. It hides the entire
> > crawl
> > cycle logic from new users, leading to questions, lack of understanding
> > of basic Nutch concepts, unsupported switches of the jobs it executes,
> > more problems etc. I am quite an opponent of the crawl command and would
> > also not
> > recommend it to anyone including new users. A running Nutch almost always
> > requires some scripting here and there, cron jobs, locks etc.
> > 
> > I propose (most likely a challenging statement) to deprecate the crawl
> > command
> > in 1.4.
> > 
> > Users, developers, please comment.
> > 
> > Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: The crawl command, keep or get rid of

Posted by Julien Nioche <li...@gmail.com>.
+1 let's replace it with a shell script instead.

On 22 August 2011 21:56, Markus Jelsma <ma...@openindex.io> wrote:

> Hi,
>
> The crawl command seems to add a lot of confusion. It hides the entire
> crawl
> cycle logic from new users, leading to questions, lack of understanding of
> basic Nutch concepts, unsupported switches of the jobs it executes, more
> problems etc. I am quite an opponent of the crawl command and would also
> not
> recommend it to anyone including new users. A running Nutch almost always
> requires some scripting here and there, cron jobs, locks etc.
>
> I propose (most likely a challenging statement) to deprecate the crawl
> command
> in 1.4.
>
> Users, developers, please comment.
>
> Thanks
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: The crawl command, keep or get rid of

Posted by Radim Kolar <hs...@sendmail.cz>.
I agree. Nuke crawl command