You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Shane Wood <sh...@cbm8bit.com> on 2014/04/03 04:52:06 UTC

One site only index.

I have indexed several site successfully.
Now i wish too index a new site and not update any other sites already 
indexed.

I use Nutch 2.21 MYSQL 5.3  and Solr 4.7.0 how would you recommend i go 
about indexing a new site only
if someone can give examples of command lines that would be amazingly 
helpful.

Cheers
Shane.

Re: One site only index.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi Shane,

The regex-urlfilter.txt will exclude "someurl.com" when you do a/multiple
cycle of "inject > generate > fetch > parse > update > solrupdate" process.
The regex-urlfilter.txt will also affects on "updatedb" and "solrindex"
steps with "-filter" as parameter applied.

Regards,


On Thu, Apr 3, 2014 at 10:44 AM, Shane Wood <sh...@cbm8bit.com> wrote:

> Can you choose a custom regex-urlfilter.txt too save editing it each time
> you wish too index a different site ?.
>
> I am surprised you can't enter a url when generating a fetch list. ie
>
> /bin/nutch generate --only  someurl.com --job 192833-292837
>
> The you fetch job 192833-292837  parse job 192833-292837 and finally
> update dbase  job 192833-292837
>
> Now that would be great..
>
> Thanks will be doing it your way for now. :)
>
> Shane.
>
>
>
> On 03/04/14 13:24, remi tassing wrote:
>
>> Hi Shane,
>>
>> You could use the same scripts as before but just modify the
>> regex-urlfilter.txt to restrict the crawling scope.
>>
>> BR, Remi
>>
>>
>> On Thu, Apr 3, 2014 at 10:52 AM, Shane Wood<sh...@cbm8bit.com>  wrote:
>>
>>
>>
>>> I have indexed several site successfully.
>>> Now i wish too index a new site and not update any other sites already
>>> indexed.
>>>
>>> I use Nutch 2.21 MYSQL 5.3  and Solr 4.7.0 how would you recommend i go
>>> about indexing a new site only
>>> if someone can give examples of command lines that would be amazingly
>>> helpful.
>>>
>>> Cheers
>>> Shane.
>>>
>>>
>>>
>>
>>
>
>


-- 
wassalam,
[bayu]

Re: One site only index.

Posted by Shane Wood <sh...@cbm8bit.com>.

Can you choose a custom regex-urlfilter.txt too save editing it each 
time you wish too index a different site ?.

I am surprised you can't enter a url when generating a fetch list. ie

/bin/nutch generate --only  someurl.com --job 192833-292837

The you fetch job 192833-292837  parse job 192833-292837 and finally 
update dbase  job 192833-292837

Now that would be great..

Thanks will be doing it your way for now. :)

Shane.


On 03/04/14 13:24, remi tassing wrote:
> Hi Shane,
>
> You could use the same scripts as before but just modify the
> regex-urlfilter.txt to restrict the crawling scope.
>
> BR, Remi
>
>
> On Thu, Apr 3, 2014 at 10:52 AM, Shane Wood<sh...@cbm8bit.com>  wrote:
>
>    
>> I have indexed several site successfully.
>> Now i wish too index a new site and not update any other sites already
>> indexed.
>>
>> I use Nutch 2.21 MYSQL 5.3  and Solr 4.7.0 how would you recommend i go
>> about indexing a new site only
>> if someone can give examples of command lines that would be amazingly
>> helpful.
>>
>> Cheers
>> Shane.
>>
>>      
>

Re: One site only index.

Posted by remi tassing <ta...@gmail.com>.

Hi Shane,

You could use the same scripts as before but just modify the
regex-urlfilter.txt to restrict the crawling scope.

BR, Remi

On Thu, Apr 3, 2014 at 10:52 AM, Shane Wood <sh...@cbm8bit.com> wrote:

> I have indexed several site successfully.
> Now i wish too index a new site and not update any other sites already
> indexed.
>
> I use Nutch 2.21 MYSQL 5.3  and Solr 4.7.0 how would you recommend i go
> about indexing a new site only
> if someone can give examples of command lines that would be amazingly
> helpful.
>
> Cheers
> Shane.
>