You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ian Piper <ia...@tellura.co.uk> on 2012/08/03 07:43:59 UTC

Re: Why won't my crawl ignore these urls? [SOLVED]

Hi,

Thanks very much for the suggestions, particularly from AC Nutch. You were correct in both cases: my regular expressions were unescaped in places, and there was a catch-all include at the top of the file. This was the crucial mistake - I hadn't got it straight in my head that the processing in this file stops once the first match is made, so anything beyond that catch-all was never being evaluated anyway!

The indexing is still a little problematic, but I'm a lot further forward now.

Thanks again for all of the suggestions.


Ian.
--


On 31 Jul 2012, at 20:22, AC Nutch wrote:

> A couple of things I could think of are:
> 
> (1) Make sure those regex excludes aren't below a "catch-all" include. If you had "+." right above those for example in the regex-urlfilter file, it is my understanding that Nutch will index them.
> 
> (2) I know everyone keeps saying this but make sure the regexes are correct. One thing I noticed is that your dots are not escaped. I would try making it more general and narrow it down, or use an online regex validation tool. If you're feeling lazy try the following:
> 
> -^http://.*\.elaweb\.org\.uk/resources/type\..*
> -^http://.*\.elaweb\.org\.uk/resources/topic\..*
> 
> It's a little more general and easier to not screw up ;-) If that's not acceptable for your purposes let us know I'm sure someone could help with the specific regexes.
> 
> 
> 
> On Mon, Jul 30, 2012 at 12:24 PM, Ian Piper <ia...@tellura.co.uk> wrote:
> Hi all,
> 
> I have been trying to get to the bottom of this problem for ages and cannot resolve it - you're my last hope, Obi-Wan...
> 
> I have a job that crawls over a client's site. I want to exclude urls that look like this:
> 
> http://[clientsite.net]/resources/type.aspx?type=[whatever]
> 
> and
> 
> http://[clientsite.net]/resources/topic.aspx?topic=[whatever]
> 
> 
> To achieve this I thought I could put this into conf/regex-urlfilter.txt:
> 
> [...]
> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.*
> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.*
> [...]
> 
> Yet when I next run the crawl I see things like this:
> 
> fetching http://[clientsite.net]/resources/topic.aspx?topic=10
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37
> [...]
> fetching http://[clientsite.net]/resources/type.aspx?type=2
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36
> [...]
> 
> and the corresponding pages seem to appear in the final Solr index. So clearly they are not being excluded.
> 
> Is anyone able to explain what I have missed? Any guidance much appreciated.
> 
> Thanks,
> 
> 
> Ian.
> -- 
> Dr Ian Piper
> Tellura Information Services - the web, document and information people
> Registered in England and Wales: 5076715, VAT Number: 874 2060 29
> http://www.tellura.co.uk/
> Creator of monickr: http://monickr.com
> 01926 813736 | 07973 156616
> -- 
> 
> <ianpiper.png>
> 
> 

-- 
Dr Ian Piper
Tellura Information Services - the web, document and information people
Registered in England and Wales: 5076715, VAT Number: 874 2060 29
http://www.tellura.co.uk/
Creator of monickr: http://monickr.com
01926 813736 | 07973 156616
-- 




Re: Why won't my crawl ignore these urls? [SOLVED]

Posted by Alejandro Caceres <ac...@hyperiongray.com>.
Glad to help and good luck!

On Fri, Aug 3, 2012 at 1:43 AM, Ian Piper <ia...@tellura.co.uk> wrote:

> Hi,
>
> Thanks very much for the suggestions, particularly from AC Nutch. You were
> correct in both cases: my regular expressions were unescaped in places, and
> there was a catch-all include at the top of the file. This was the crucial
> mistake - I hadn't got it straight in my head that the processing in this
> file stops once the first match is made, so anything beyond that catch-all
> was never being evaluated anyway!
>
> The indexing is still a little problematic, but I'm a lot further forward
> now.
>
> Thanks again for all of the suggestions.
>
>
> Ian.
> --
>
>
> On 31 Jul 2012, at 20:22, AC Nutch wrote:
>
> A couple of things I could think of are:
>
> (1) Make sure those regex excludes aren't below a "catch-all" include. If
> you had "+." right above those for example in the regex-urlfilter file, it
> is my understanding that Nutch will index them.
>
> (2) I know everyone keeps saying this but make sure the regexes are
> correct. One thing I noticed is that your dots are not escaped. I would try
> making it more general and narrow it down, or use an online regex
> validation tool. If you're feeling lazy try the following:
>
> -^http://.*\.elaweb\.org\.uk/resources/type\..*<http://www.elaweb.org.uk/resources/type.aspx.*>
> -^http://.*\.elaweb\.org\.uk/resources/topic\..*<http://www.elaweb.org.uk/resources/topic.aspx.*>
>
> It's a little more general and easier to not screw up ;-) If that's not
> acceptable for your purposes let us know I'm sure someone could help with
> the specific regexes.
>
>
>
> On Mon, Jul 30, 2012 at 12:24 PM, Ian Piper <ia...@tellura.co.uk>wrote:
>
>> Hi all,
>>
>> I have been trying to get to the bottom of this problem for ages and
>> cannot resolve it - you're my last hope, Obi-Wan...
>>
>> I have a job that crawls over a client's site. I want to exclude urls
>> that look like this:
>>
>> http://[clientsite.net]/resources/type.aspx?type=[whatever]
>>
>> and
>>
>> http://[clientsite.net]/resources/topic.aspx?topic=[whatever]
>>
>>
>> To achieve this I thought I could put this into conf/regex-urlfilter.txt:
>>
>> [...]
>> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.*
>> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.*
>> [...]
>>
>> Yet when I next run the crawl I see things like this:
>>
>> fetching http://[clientsite.net]/resources/topic.aspx?topic=10
>> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37
>> [...]
>> fetching http://[clientsite.net]/resources/type.aspx?type=2
>> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36
>> [...]
>>
>> and the corresponding pages seem to appear in the final Solr index. So
>> clearly they are not being excluded.
>>
>> Is anyone able to explain what I have missed? Any guidance much
>> appreciated.
>>
>> Thanks,
>>
>>
>> Ian.
>>  *-- *
>> *Dr Ian Piper*
>> Tellura Information Services - the web, document and information people
>> Registered in England and Wales: 5076715, VAT Number: 874 2060 29
>> http://www.tellura.co.uk/
>> Creator of monickr: http://monickr.com
>> 01926 813736 | 07973 156616
>> *-- *
>>
>> <ianpiper.png>
>>
>>
>
> *-- *
> *Dr Ian Piper*
> Tellura Information Services - the web, document and information people
> Registered in England and Wales: 5076715, VAT Number: 874 2060 29
> http://www.tellura.co.uk/
> Creator of monickr: http://monickr.com
> 01926 813736 | 07973 156616
> *-- *
>
>
>