You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/08/22 22:56:49 UTC
The crawl command, keep or get rid of
Hi,
The crawl command seems to add a lot of confusion. It hides the entire crawl
cycle logic from new users, leading to questions, lack of understanding of
basic Nutch concepts, unsupported switches of the jobs it executes, more
problems etc. I am quite an opponent of the crawl command and would also not
recommend it to anyone including new users. A running Nutch almost always
requires some scripting here and there, cron jobs, locks etc.
I propose (most likely a challenging statement) to deprecate the crawl command
in 1.4.
Users, developers, please comment.
Thanks
Re: The crawl command, keep or get rid of
Posted by Markus Jelsma <ma...@openindex.io>.
You're right: https://issues.apache.org/jira/browse/NUTCH-1087
On Tuesday 23 August 2011 13:24:27 Julien Nioche wrote:
> > What kind of shell script did you have in mind? The wiki already provides
> > some
> > useful scripts. The tutorials on Nutch also show commands that can be
> > used in
> > custom scripts.
>
> That's exactly my point. There are various scripts in the wiki, based on
> different versions of Nutch and of variable quality (e.g. some won't work
> in distributed mode) etc... Let's have one in the repository so that people
> stop reinventing the wheel or ask where to get one.
> Of course most of the script will examplify the commands from the Wiki and
> it will have a good educational value as well as being useful
>
> Julien
>
> > Is an immediate crawl-with-one-command a desired feature? Provided as
> > Java code or shell script?
> >
> > On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
> > > +1 let's replace it with a shell script instead.
> > >
> > > On 22 August 2011 21:56, Markus Jelsma <ma...@openindex.io>
> >
> > wrote:
> > > > Hi,
> > > >
> > > > The crawl command seems to add a lot of confusion. It hides the
> > > > entire crawl
> > > > cycle logic from new users, leading to questions, lack of
> > > > understanding of basic Nutch concepts, unsupported switches of the
> > > > jobs it executes, more problems etc. I am quite an opponent of the
> > > > crawl command and
> >
> > would
> >
> > > > also not
> > > > recommend it to anyone including new users. A running Nutch almost
> >
> > always
> >
> > > > requires some scripting here and there, cron jobs, locks etc.
> > > >
> > > > I propose (most likely a challenging statement) to deprecate the
> > > > crawl command
> > > > in 1.4.
> > > >
> > > > Users, developers, please comment.
> > > >
> > > > Thanks
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: The crawl command, keep or get rid of
Posted by Julien Nioche <li...@gmail.com>.
> What kind of shell script did you have in mind? The wiki already provides
> some
> useful scripts. The tutorials on Nutch also show commands that can be used
> in
> custom scripts.
>
That's exactly my point. There are various scripts in the wiki, based on
different versions of Nutch and of variable quality (e.g. some won't work
in distributed mode) etc... Let's have one in the repository so that people
stop reinventing the wheel or ask where to get one.
Of course most of the script will examplify the commands from the Wiki and
it will have a good educational value as well as being useful
Julien
> Is an immediate crawl-with-one-command a desired feature? Provided as Java
> code or shell script?
>
> On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
> > +1 let's replace it with a shell script instead.
> >
> > On 22 August 2011 21:56, Markus Jelsma <ma...@openindex.io>
> wrote:
> > > Hi,
> > >
> > > The crawl command seems to add a lot of confusion. It hides the entire
> > > crawl
> > > cycle logic from new users, leading to questions, lack of understanding
> > > of basic Nutch concepts, unsupported switches of the jobs it executes,
> > > more problems etc. I am quite an opponent of the crawl command and
> would
> > > also not
> > > recommend it to anyone including new users. A running Nutch almost
> always
> > > requires some scripting here and there, cron jobs, locks etc.
> > >
> > > I propose (most likely a challenging statement) to deprecate the crawl
> > > command
> > > in 1.4.
> > >
> > > Users, developers, please comment.
> > >
> > > Thanks
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: The crawl command, keep or get rid of
Posted by Eric Pugh <ep...@opensourceconnections.com>.
I wonder if the name "crawl" implies that the command is sort of standard command, and all you would need? After all, if I where to sit down with a "crawler", it seems very logical that "crawl" would be how you run it! I like the simplicity of crawl from a "getting started" approach. I agree though that I know I used it as a short cut... I didn't want to learn all the lower level concepts, I just wanted to crawl a couple URLs and toss them into Solr. "crawl" and the example code did great!
Maybe instead of having "crawl" be a core part of running Nutch, instead it's "run-example-crawl.sh" and in the Wiki it's caveated that you should then look inside it and learn all the various steps.
Eric
On Aug 23, 2011, at 6:50 AM, Markus Jelsma wrote:
> What kind of shell script did you have in mind? The wiki already provides some
> useful scripts. The tutorials on Nutch also show commands that can be used in
> custom scripts.
>
> Is an immediate crawl-with-one-command a desired feature? Provided as Java
> code or shell script?
>
> On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
>> +1 let's replace it with a shell script instead.
>>
>> On 22 August 2011 21:56, Markus Jelsma <ma...@openindex.io> wrote:
>>> Hi,
>>>
>>> The crawl command seems to add a lot of confusion. It hides the entire
>>> crawl
>>> cycle logic from new users, leading to questions, lack of understanding
>>> of basic Nutch concepts, unsupported switches of the jobs it executes,
>>> more problems etc. I am quite an opponent of the crawl command and would
>>> also not
>>> recommend it to anyone including new users. A running Nutch almost always
>>> requires some scripting here and there, cron jobs, locks etc.
>>>
>>> I propose (most likely a challenging statement) to deprecate the crawl
>>> command
>>> in 1.4.
>>>
>>> Users, developers, please comment.
>>>
>>> Thanks
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: The crawl command, keep or get rid of
Posted by Markus Jelsma <ma...@openindex.io>.
What kind of shell script did you have in mind? The wiki already provides some
useful scripts. The tutorials on Nutch also show commands that can be used in
custom scripts.
Is an immediate crawl-with-one-command a desired feature? Provided as Java
code or shell script?
On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
> +1 let's replace it with a shell script instead.
>
> On 22 August 2011 21:56, Markus Jelsma <ma...@openindex.io> wrote:
> > Hi,
> >
> > The crawl command seems to add a lot of confusion. It hides the entire
> > crawl
> > cycle logic from new users, leading to questions, lack of understanding
> > of basic Nutch concepts, unsupported switches of the jobs it executes,
> > more problems etc. I am quite an opponent of the crawl command and would
> > also not
> > recommend it to anyone including new users. A running Nutch almost always
> > requires some scripting here and there, cron jobs, locks etc.
> >
> > I propose (most likely a challenging statement) to deprecate the crawl
> > command
> > in 1.4.
> >
> > Users, developers, please comment.
> >
> > Thanks
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: The crawl command, keep or get rid of
Posted by Julien Nioche <li...@gmail.com>.
+1 let's replace it with a shell script instead.
On 22 August 2011 21:56, Markus Jelsma <ma...@openindex.io> wrote:
> Hi,
>
> The crawl command seems to add a lot of confusion. It hides the entire
> crawl
> cycle logic from new users, leading to questions, lack of understanding of
> basic Nutch concepts, unsupported switches of the jobs it executes, more
> problems etc. I am quite an opponent of the crawl command and would also
> not
> recommend it to anyone including new users. A running Nutch almost always
> requires some scripting here and there, cron jobs, locks etc.
>
> I propose (most likely a challenging statement) to deprecate the crawl
> command
> in 1.4.
>
> Users, developers, please comment.
>
> Thanks
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: The crawl command, keep or get rid of
Posted by Radim Kolar <hs...@sendmail.cz>.
I agree. Nuke crawl command