You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by kiran chitturi <ch...@gmail.com> on 2013/03/20 20:01:32 UTC

wiki update on Nutch Tutorial with crawl script

Hi!

I want to update the Nutch tutorials in the wiki with the crawl script
(./bin/crawl). The presence of the crawl command in the tutorials makes
users use these crawl command run in to issues which makes us suggest them
use the crawl script instead of the command.

Can we make it uniform all over wiki that crawl command is deprecated and
it is recommended to use crawl script ?

Second, for a user running Nutch on a single node or local mode the default
size of topN (50,000) makes the crawl run for a long time. Can we make the
topN parameter configurable through the script ?

Thank you,

-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Re: wiki update on Nutch Tutorial with crawl script

Posted by Tejas Patil <te...@gmail.com>.
Phew.. was about to ignore this one as it was hidden among lot of other
auto generated emails for wiki updates !!

On Wed, Mar 20, 2013 at 12:01 PM, kiran chitturi
<ch...@gmail.com>wrote:

> Hi!
>
> I want to update the Nutch tutorials in the wiki with the crawl script
> (./bin/crawl). The presence of the crawl command in the tutorials makes
> users use these crawl command run in to issues which makes us suggest them
> use the crawl script instead of the command.
>
> Can we make it uniform all over wiki that crawl command is deprecated and
> it is recommended to use crawl script ?
>
> Yes. The references to crawl command must be replaced with the crawl
script in the tutorials.

Second, for a user running Nutch on a single node or local mode the default
> size of topN (50,000) makes the crawl run for a long time. Can we make the
> topN parameter configurable through the script ?
>
> I think that the crawl script has lot of hard-coding and can be used by
people for getting started with crawl without getting bugged with the
params, their explanations and optimal values to be set. The script
says "MODIFY
THE PARAMETERS BELOW TO YOUR NEEDS" so that anyone who feels to change
these values can modify it. I prefer keeping it as it is for now. Lets see
whats others have to say about this.


Thank you,
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>
>
>

Re: wiki update on Nutch Tutorial with crawl script

Posted by kiran chitturi <ch...@gmail.com>.
I have kept the crawl command but notified the users that it is deprecated.
I have added the crawl script in section 3.3 [0]

The wiki looks a bit updated and I hope all the basic questions by Nutch
Users can be redirected to wiki pointers.

*Few things still need to be updated:*
1. How to choose Nutch parameters for optimal configuration
2. A full tutorial for Nutch 2 with Hbase. Notify users of current bugs
with MySql and others stores.

Please add here if someone feels any section is updated

[0] - http://wiki.apache.org/nutch/NutchTutorial


On Thu, Mar 21, 2013 at 3:43 AM, kiran chitturi
<ch...@gmail.com>wrote:

> Hi Feng, I have created a wiki page for (bin/crawl) thinking about this.
> Please feel free to edit any of the wiki's and update the documentation.
>
>
>
> [0] http://wiki.apache.org/nutch/bin/crawl
>
>
> On Thu, Mar 21, 2013 at 1:18 AM, feng lu <am...@gmail.com> wrote:
>
>> <<
>> Second, for a user running Nutch on a single node or local mode the
>> default size of topN (50,000) makes the crawl run for a long time. Can we
>> make the topN parameter configurable through the script ?
>> >>
>>
>> May be i agree with Tejas that let user to modify the parameters below to
>> their needs. But we can add some detail information into the bin/crawl
>> wiki to tell users how to modify these parameters and what is the meaning
>> of these parameters.
>>
>>
>> On Thu, Mar 21, 2013 at 3:01 AM, kiran chitturi <
>> chitturikiran15@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> I want to update the Nutch tutorials in the wiki with the crawl script
>>> (./bin/crawl). The presence of the crawl command in the tutorials makes
>>> users use these crawl command run in to issues which makes us suggest them
>>> use the crawl script instead of the command.
>>>
>>> Can we make it uniform all over wiki that crawl command is deprecated
>>> and it is recommended to use crawl script ?
>>>
>>> Second, for a user running Nutch on a single node or local mode the
>>> default size of topN (50,000) makes the crawl run for a long time. Can we
>>> make the topN parameter configurable through the script ?
>>>
>>> Thank you,
>>>
>>> --
>>> Kiran Chitturi
>>>
>>> <http://www.linkedin.com/in/kiranchitturi>
>>>
>>>
>>>
>>
>>
>> --
>> Don't Grow Old, Grow Up... :-)
>>
>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>
>
>


-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Re: wiki update on Nutch Tutorial with crawl script

Posted by kiran chitturi <ch...@gmail.com>.
Hi Feng, I have created a wiki page for (bin/crawl) thinking about this.
Please feel free to edit any of the wiki's and update the documentation.



[0] http://wiki.apache.org/nutch/bin/crawl


On Thu, Mar 21, 2013 at 1:18 AM, feng lu <am...@gmail.com> wrote:

> <<
> Second, for a user running Nutch on a single node or local mode the
> default size of topN (50,000) makes the crawl run for a long time. Can we
> make the topN parameter configurable through the script ?
> >>
>
> May be i agree with Tejas that let user to modify the parameters below to
> their needs. But we can add some detail information into the bin/crawl
> wiki to tell users how to modify these parameters and what is the meaning
> of these parameters.
>
>
> On Thu, Mar 21, 2013 at 3:01 AM, kiran chitturi <chitturikiran15@gmail.com
> > wrote:
>
>> Hi!
>>
>> I want to update the Nutch tutorials in the wiki with the crawl script
>> (./bin/crawl). The presence of the crawl command in the tutorials makes
>> users use these crawl command run in to issues which makes us suggest them
>> use the crawl script instead of the command.
>>
>> Can we make it uniform all over wiki that crawl command is deprecated and
>> it is recommended to use crawl script ?
>>
>> Second, for a user running Nutch on a single node or local mode the
>> default size of topN (50,000) makes the crawl run for a long time. Can we
>> make the topN parameter configurable through the script ?
>>
>> Thank you,
>>
>> --
>> Kiran Chitturi
>>
>> <http://www.linkedin.com/in/kiranchitturi>
>>
>>
>>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Re: wiki update on Nutch Tutorial with crawl script

Posted by feng lu <am...@gmail.com>.
<<
Second, for a user running Nutch on a single node or local mode the default
size of topN (50,000) makes the crawl run for a long time. Can we make the
topN parameter configurable through the script ?
>>

May be i agree with Tejas that let user to modify the parameters below to
their needs. But we can add some detail information into the bin/crawl wiki
to tell users how to modify these parameters and what is the meaning of
these parameters.


On Thu, Mar 21, 2013 at 3:01 AM, kiran chitturi
<ch...@gmail.com>wrote:

> Hi!
>
> I want to update the Nutch tutorials in the wiki with the crawl script
> (./bin/crawl). The presence of the crawl command in the tutorials makes
> users use these crawl command run in to issues which makes us suggest them
> use the crawl script instead of the command.
>
> Can we make it uniform all over wiki that crawl command is deprecated and
> it is recommended to use crawl script ?
>
> Second, for a user running Nutch on a single node or local mode the
> default size of topN (50,000) makes the crawl run for a long time. Can we
> make the topN parameter configurable through the script ?
>
> Thank you,
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>
>
>


-- 
Don't Grow Old, Grow Up... :-)