You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by A Laxmi <a....@gmail.com> on 2013/10/28 14:10:57 UTC

Nutch crawl nutch commands

Hi,

For Nutch 2.2.1, I am aware of two crawl commands/scripts that came out of
the box with nutch -

(1) bin/nutch (step by step),
(2) bin/crawl (all in one)

I know how to specify a crawl ID for `bin/crawl` command. Similarly, how to
specify a crawl ID for `bin/nutch` command?

The reason I am asking is, I ran a large crawl job using `all-in-one crawl
command "bin/crawl"` specifying a crawl ID, it broke while indexing in Solr
for 9th crawl iteration. Now, I just want to run one step `"bin/nutch
solrindex"` command for just that interrupted 9th iteration to complete the
solr indexing. How should I specify crawlID in "`bin/nutch solrindex`"
command? What is the syntax?

I have all the crawl data stored in a HBase table "webpage_test"

Re: Nutch crawl nutch commands

Posted by A Laxmi <a....@gmail.com>.

Hey Talat!!

Is there anyway I can specify the batchID as well in the following command?

bin/nutch solrindex <solr url> -all -crawlId <crawl id>


On Mon, Oct 28, 2013 at 11:51 AM, Talat UYARER <ta...@agmlab.com>wrote:

> It is right Laxmi. We dont have SolrIndexerJob command :)
> you can use SolrIndexerJob with nutch shell script. May be you can use
> Like this:
>
> bin/nutch solrindex <solr url> -all -crawlId <crawl id>
>
> Talat
>
> 28-10-2013 17:46 tarihinde, A Laxmi yazdı:
>
>  It says SolrIndexerJob: command not found
>>
>> when I followed this syntax
>>
>> SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]
>>
>>
>>
>>
>>
>> On Mon, Oct 28, 2013 at 11:29 AM, feng lu <am...@gmail.com> wrote:
>>
>>  Hi Laxmi
>>>
>>> I check at code in bin/crawl script
>>>
>>> echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL"
>>>    $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
>>>
>>> if what you say is correct, then that script will also ignore the bachID
>>> and crawlID.
>>>
>>> you can try a small test db and run bin/nutch script step by step.
>>>
>>>
>>> On Mon, Oct 28, 2013 at 10:57 PM, A Laxmi <a....@gmail.com>
>>> wrote:
>>>
>>>  Hi feng -
>>>>
>>>> I tried but its ignoring the batch ID and crawlID for some reason.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 10:00 AM, feng lu <am...@gmail.com> wrote:
>>>>
>>>>  Hi
>>>>>
>>>>> please check the usage of solrindex command
>>>>>
>>>>> $ bin/nutch solrindex
>>>>> Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex)
>>>>>
>>>> [-crawlId
>>>
>>>> <id>]
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 9:10 PM, A Laxmi <a....@gmail.com>
>>>>>
>>>> wrote:
>>>
>>>>
>>>>>  Hi,
>>>>>>
>>>>>> For Nutch 2.2.1, I am aware of two crawl commands/scripts that came
>>>>>>
>>>>> out
>>>
>>>> of
>>>>>
>>>>>> the box with nutch -
>>>>>>
>>>>>> (1) bin/nutch (step by step),
>>>>>> (2) bin/crawl (all in one)
>>>>>>
>>>>>> I know how to specify a crawl ID for `bin/crawl` command. Similarly,
>>>>>>
>>>>> how
>>>>
>>>>> to
>>>>>
>>>>>> specify a crawl ID for `bin/nutch` command?
>>>>>>
>>>>>> The reason I am asking is, I ran a large crawl job using `all-in-one
>>>>>>
>>>>> crawl
>>>>>
>>>>>> command "bin/crawl"` specifying a crawl ID, it broke while indexing
>>>>>>
>>>>> in
>>>
>>>> Solr
>>>>>
>>>>>> for 9th crawl iteration. Now, I just want to run one step `"bin/nutch
>>>>>> solrindex"` command for just that interrupted 9th iteration to
>>>>>>
>>>>> complete
>>>
>>>> the
>>>>>
>>>>>> solr indexing. How should I specify crawlID in "`bin/nutch
>>>>>>
>>>>> solrindex`"
>>>
>>>> command? What is the syntax?
>>>>>>
>>>>>> I have all the crawl data stored in a HBase table "webpage_test"
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Don't Grow Old, Grow Up... :-)
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Don't Grow Old, Grow Up... :-)
>>>
>>>
>>
>

Re: Nutch crawl nutch commands

Posted by Talat UYARER <ta...@agmlab.com>.

It is right Laxmi. We dont have SolrIndexerJob command :)
you can use SolrIndexerJob with nutch shell script. May be you can use 
Like this:

bin/nutch solrindex <solr url> -all -crawlId <crawl id>

Talat

28-10-2013 17:46 tarihinde, A Laxmi yazdı:
> It says SolrIndexerJob: command not found
>
> when I followed this syntax
>
> SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]
>
>
>
>
>
> On Mon, Oct 28, 2013 at 11:29 AM, feng lu <am...@gmail.com> wrote:
>
>> Hi Laxmi
>>
>> I check at code in bin/crawl script
>>
>> echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL"
>>    $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
>>
>> if what you say is correct, then that script will also ignore the bachID
>> and crawlID.
>>
>> you can try a small test db and run bin/nutch script step by step.
>>
>>
>> On Mon, Oct 28, 2013 at 10:57 PM, A Laxmi <a....@gmail.com> wrote:
>>
>>> Hi feng -
>>>
>>> I tried but its ignoring the batch ID and crawlID for some reason.
>>>
>>>
>>>
>>>
>>> On Mon, Oct 28, 2013 at 10:00 AM, feng lu <am...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> please check the usage of solrindex command
>>>>
>>>> $ bin/nutch solrindex
>>>> Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex)
>> [-crawlId
>>>> <id>]
>>>>
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 9:10 PM, A Laxmi <a....@gmail.com>
>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> For Nutch 2.2.1, I am aware of two crawl commands/scripts that came
>> out
>>>> of
>>>>> the box with nutch -
>>>>>
>>>>> (1) bin/nutch (step by step),
>>>>> (2) bin/crawl (all in one)
>>>>>
>>>>> I know how to specify a crawl ID for `bin/crawl` command. Similarly,
>>> how
>>>> to
>>>>> specify a crawl ID for `bin/nutch` command?
>>>>>
>>>>> The reason I am asking is, I ran a large crawl job using `all-in-one
>>>> crawl
>>>>> command "bin/crawl"` specifying a crawl ID, it broke while indexing
>> in
>>>> Solr
>>>>> for 9th crawl iteration. Now, I just want to run one step `"bin/nutch
>>>>> solrindex"` command for just that interrupted 9th iteration to
>> complete
>>>> the
>>>>> solr indexing. How should I specify crawlID in "`bin/nutch
>> solrindex`"
>>>>> command? What is the syntax?
>>>>>
>>>>> I have all the crawl data stored in a HBase table "webpage_test"
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Don't Grow Old, Grow Up... :-)
>>>>
>>>
>>
>>
>>
>> --
>> Don't Grow Old, Grow Up... :-)
>>
>

Re: Nutch crawl nutch commands

Posted by A Laxmi <a....@gmail.com>.

It says SolrIndexerJob: command not found

when I followed this syntax

SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]





On Mon, Oct 28, 2013 at 11:29 AM, feng lu <am...@gmail.com> wrote:

> Hi Laxmi
>
> I check at code in bin/crawl script
>
> echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL"
>   $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
>
> if what you say is correct, then that script will also ignore the bachID
> and crawlID.
>
> you can try a small test db and run bin/nutch script step by step.
>
>
> On Mon, Oct 28, 2013 at 10:57 PM, A Laxmi <a....@gmail.com> wrote:
>
> > Hi feng -
> >
> > I tried but its ignoring the batch ID and crawlID for some reason.
> >
> >
> >
> >
> > On Mon, Oct 28, 2013 at 10:00 AM, feng lu <am...@gmail.com> wrote:
> >
> > > Hi
> > >
> > > please check the usage of solrindex command
> > >
> > > $ bin/nutch solrindex
> > > Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex)
> [-crawlId
> > > <id>]
> > >
> > >
> > >
> > > On Mon, Oct 28, 2013 at 9:10 PM, A Laxmi <a....@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > For Nutch 2.2.1, I am aware of two crawl commands/scripts that came
> out
> > > of
> > > > the box with nutch -
> > > >
> > > > (1) bin/nutch (step by step),
> > > > (2) bin/crawl (all in one)
> > > >
> > > > I know how to specify a crawl ID for `bin/crawl` command. Similarly,
> > how
> > > to
> > > > specify a crawl ID for `bin/nutch` command?
> > > >
> > > > The reason I am asking is, I ran a large crawl job using `all-in-one
> > > crawl
> > > > command "bin/crawl"` specifying a crawl ID, it broke while indexing
> in
> > > Solr
> > > > for 9th crawl iteration. Now, I just want to run one step `"bin/nutch
> > > > solrindex"` command for just that interrupted 9th iteration to
> complete
> > > the
> > > > solr indexing. How should I specify crawlID in "`bin/nutch
> solrindex`"
> > > > command? What is the syntax?
> > > >
> > > > I have all the crawl data stored in a HBase table "webpage_test"
> > > >
> > >
> > >
> > >
> > > --
> > > Don't Grow Old, Grow Up... :-)
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Nutch crawl nutch commands

Posted by feng lu <am...@gmail.com>.

Hi Laxmi

I check at code in bin/crawl script

echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL"
  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID

if what you say is correct, then that script will also ignore the bachID
and crawlID.

you can try a small test db and run bin/nutch script step by step.


On Mon, Oct 28, 2013 at 10:57 PM, A Laxmi <a....@gmail.com> wrote:

> Hi feng -
>
> I tried but its ignoring the batch ID and crawlID for some reason.
>
>
>
>
> On Mon, Oct 28, 2013 at 10:00 AM, feng lu <am...@gmail.com> wrote:
>
> > Hi
> >
> > please check the usage of solrindex command
> >
> > $ bin/nutch solrindex
> > Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId
> > <id>]
> >
> >
> >
> > On Mon, Oct 28, 2013 at 9:10 PM, A Laxmi <a....@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > For Nutch 2.2.1, I am aware of two crawl commands/scripts that came out
> > of
> > > the box with nutch -
> > >
> > > (1) bin/nutch (step by step),
> > > (2) bin/crawl (all in one)
> > >
> > > I know how to specify a crawl ID for `bin/crawl` command. Similarly,
> how
> > to
> > > specify a crawl ID for `bin/nutch` command?
> > >
> > > The reason I am asking is, I ran a large crawl job using `all-in-one
> > crawl
> > > command "bin/crawl"` specifying a crawl ID, it broke while indexing in
> > Solr
> > > for 9th crawl iteration. Now, I just want to run one step `"bin/nutch
> > > solrindex"` command for just that interrupted 9th iteration to complete
> > the
> > > solr indexing. How should I specify crawlID in "`bin/nutch solrindex`"
> > > command? What is the syntax?
> > >
> > > I have all the crawl data stored in a HBase table "webpage_test"
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Nutch crawl nutch commands

Posted by A Laxmi <a....@gmail.com>.

Hi feng -

I tried but its ignoring the batch ID and crawlID for some reason.




On Mon, Oct 28, 2013 at 10:00 AM, feng lu <am...@gmail.com> wrote:

> Hi
>
> please check the usage of solrindex command
>
> $ bin/nutch solrindex
> Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId
> <id>]
>
>
>
> On Mon, Oct 28, 2013 at 9:10 PM, A Laxmi <a....@gmail.com> wrote:
>
> > Hi,
> >
> > For Nutch 2.2.1, I am aware of two crawl commands/scripts that came out
> of
> > the box with nutch -
> >
> > (1) bin/nutch (step by step),
> > (2) bin/crawl (all in one)
> >
> > I know how to specify a crawl ID for `bin/crawl` command. Similarly, how
> to
> > specify a crawl ID for `bin/nutch` command?
> >
> > The reason I am asking is, I ran a large crawl job using `all-in-one
> crawl
> > command "bin/crawl"` specifying a crawl ID, it broke while indexing in
> Solr
> > for 9th crawl iteration. Now, I just want to run one step `"bin/nutch
> > solrindex"` command for just that interrupted 9th iteration to complete
> the
> > solr indexing. How should I specify crawlID in "`bin/nutch solrindex`"
> > command? What is the syntax?
> >
> > I have all the crawl data stored in a HBase table "webpage_test"
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Nutch crawl nutch commands

Posted by feng lu <am...@gmail.com>.

Hi

please check the usage of solrindex command

$ bin/nutch solrindex
Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId
<id>]



On Mon, Oct 28, 2013 at 9:10 PM, A Laxmi <a....@gmail.com> wrote:

> Hi,
>
> For Nutch 2.2.1, I am aware of two crawl commands/scripts that came out of
> the box with nutch -
>
> (1) bin/nutch (step by step),
> (2) bin/crawl (all in one)
>
> I know how to specify a crawl ID for `bin/crawl` command. Similarly, how to
> specify a crawl ID for `bin/nutch` command?
>
> The reason I am asking is, I ran a large crawl job using `all-in-one crawl
> command "bin/crawl"` specifying a crawl ID, it broke while indexing in Solr
> for 9th crawl iteration. Now, I just want to run one step `"bin/nutch
> solrindex"` command for just that interrupted 9th iteration to complete the
> solr indexing. How should I specify crawlID in "`bin/nutch solrindex`"
> command? What is the syntax?
>
> I have all the crawl data stored in a HBase table "webpage_test"
>



-- 
Don't Grow Old, Grow Up... :-)