You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Amarnatha Reddy <po...@gmail.com> on 2018/10/03 13:23:20 UTC
Regex to block some patterns
Hi Team,
I need some assistance to block patterns in my current setup.
Always my seed url is *https://www.abc.com/ <https://www.abc.com/>* and
need to crawl all pages except below patterns in Nutch1.15
Blocking pattern *modal(.*).html *and *exit.html? *and *exit.html/?*
Sample pages *modal.html, modal_1123Abc.html, modalaa_12.html* (these could
be end of the domain)
Below are the few use case urls'
https://www.abc.com/abc-editions/2018/test-ask/altitude/feature-pillar/abc/acb-1/modal.html
https://www.abc.com/2017/ask/exterior/feature_overlay/modalcontainer5.html
https://www.abc.com/2017/image/exterior/abc/feature_overlay/modalcontainer5_Ab_c.html
exit.html (here anything like this exit.html? exit.html/?)
Ask here is after domain (https://www.abc.com/), starts with
exit.html/exit.html?/exit.html/? then need to block/exclude crawl.
https://www.abc.com/exit.html?url=https://www.gear.abc.com/welcome.asp
https://www.abc.com/exit.html/?tname=abc_facebook&url=http://www.facebook.com/abc&message=true
*Note: Yes we can directly put - ^(complete url) ,but dont know how many
are there, so need generic regex rule to apply.*
i tried below pattern,but it is not working
## Blocking pattern ends with ####
-^(?i)\*(modal*|exit*).html
Kindly help me to setup regex to block my use case.
Thanks,
Amarnath
------------------------------
Thanks and Regards,
*Amarnath Polu*