You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Amarnatha Reddy <po...@gmail.com> on 2018/10/03 13:23:20 UTC

Regex to block some patterns

Hi Team,



I need some assistance to block patterns in my current setup.



Always my seed url is *https://www.abc.com/ <https://www.abc.com/>* and
need to crawl all pages except below patterns in Nutch1.15


Blocking pattern *modal(.*).html *and *exit.html? *and *exit.html/?*

Sample pages *modal.html, modal_1123Abc.html, modalaa_12.html* (these could
be end of the domain)



Below are the few use case urls'


https://www.abc.com/abc-editions/2018/test-ask/altitude/feature-pillar/abc/acb-1/modal.html

https://www.abc.com/2017/ask/exterior/feature_overlay/modalcontainer5.html

https://www.abc.com/2017/image/exterior/abc/feature_overlay/modalcontainer5_Ab_c.html



exit.html (here anything like this exit.html? exit.html/?)


Ask here is after domain (https://www.abc.com/), starts with
exit.html/exit.html?/exit.html/?  then need to block/exclude crawl.

 https://www.abc.com/exit.html?url=https://www.gear.abc.com/welcome.asp

https://www.abc.com/exit.html/?tname=abc_facebook&url=http://www.facebook.com/abc&message=true


*Note: Yes we can directly put - ^(complete url) ,but dont know how many
are there, so need generic regex rule to apply.*


i tried below pattern,but it is not working

## Blocking pattern ends with ####

-^(?i)\*(modal*|exit*).html



Kindly help me to setup regex to block my use case.



Thanks,

Amarnath




------------------------------

Thanks and Regards,

*Amarnath Polu*