You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Venkata MR <Ve...@hcl.com> on 2018/12/01 13:45:19 UTC

URL filter rejecting the URLs

Hi Nutch Users,

I was trying to crawl the site (https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G, https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=L), with the filter patter as "+^https?://nseindia\.com\/live\_market\/dynaContent\/live\_analysis\/top\_gainers\_losers\.htm\?cat\=([GL])", it is rejecting the urls.

Tried multiple options but all the cases it is rejecting.

Any help here is appreciated, Thanks!

Thanks & Regards
Venkata MR
+91 98455 77125

::DISCLAIMER::
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

RE: URL filter rejecting the URLs

Posted by Venkata MR <Ve...@hcl.com>.
Hi Sebastian,

Thanks for the response, I resolved the issue and the reason is below configuration in regex-urlfilter.txt

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

Thanks & Regards
Venkata MR
+91 98455 77125

-----Original Message-----
From: Sebastian Nagel <wa...@googlemail.com.INVALID> 
Sent: 04 December 2018 01:16
To: user@nutch.apache.org
Subject: Re: URL filter rejecting the URLs

Hi,

the pattern should work. Of course, you need to make sure that
- there are no other patterns coming before in regex-urlfilter.txt
  which cause the URL to be rejected
- other URL filters being active which reject the URL
- make sure that the folder of the regex-urlfilter.txt you're editing
  is first on the class path. Usually, $NUTCH_HOME/conf/regex-urlfilter.txt is used
- (optionally) you may simplify the regex: the characters /_= have no special semantic
  and do not need to be escaped by \

The easiest way to test it (Nutch 1.15):
% cat $NUTCH_HOME/conf/regex-urlfilter.txt
+^https?://nseindia\.com/live_market/dynaContent/live_analysis/top_gaine
+rs_losers\.htm\?cat=([GL])
-.
% echo "https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DG&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782785388&amp;sdata=4cE6hBJDBE7EYxF4FT25BfosjMlCxsYQ3XRflDZqYiI%3D&amp;reserved=0)" \
   | nutch filterchecker -filterName urlfilter-regex -stdin Checking combination of these URLFilters: RegexURLFilter
+https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnsei
+ndia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_lose
+rs.htm%3Fcat%3DG&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076
+452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C63679
+4631782795402&amp;sdata=XIq2GqpuT1ndZ2gtBPalj%2BaZhxvPm6HYTJmxnnaiT58%3
+D&amp;reserved=0)


And with another "forbidden" URL:
% echo "https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DX&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402&amp;sdata=7wYVN3j7OERUcQJPSxPo%2FzHaofHRerqC4GCUT2Lenng%3D&amp;reserved=0)" \
  | nutch filterchecker -filterName urlfilter-regex -stdin Checking combination of these URLFilters: RegexURLFilter
-https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DX&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402&amp;sdata=7wYVN3j7OERUcQJPSxPo%2FzHaofHRerqC4GCUT2Lenng%3D&amp;reserved=0)


Best,
Sebastian

On 12/1/18 2:45 PM, Venkata MR wrote:
> Hi Nutch Users,
> 
> I was trying to crawl the site (https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DG&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402&amp;sdata=XIq2GqpuT1ndZ2gtBPalj%2BaZhxvPm6HYTJmxnnaiT58%3D&amp;reserved=0, https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DL&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402&amp;sdata=AqS%2B%2B6dAQ5Dwd36%2BIoPgZRfG8yxzVo3FvNrX3ZjtQLg%3D&amp;reserved=0), with the filter patter as "+^https?://nseindia\.com\/live\_market\/dynaContent\/live\_analysis\/top\_gainers\_losers\.htm\?cat\=([GL])", it is rejecting the urls.
> 
> Tried multiple options but all the cases it is rejecting.
> 
> Any help here is appreciated, Thanks!
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> ::DISCLAIMER::
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------------------------------------------------------------------
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------------------------------------------------------------------
> 


Re: URL filter rejecting the URLs

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi,

the pattern should work. Of course, you need to make sure that
- there are no other patterns coming before in regex-urlfilter.txt
  which cause the URL to be rejected
- other URL filters being active which reject the URL
- make sure that the folder of the regex-urlfilter.txt you're editing
  is first on the class path. Usually, $NUTCH_HOME/conf/regex-urlfilter.txt is used
- (optionally) you may simplify the regex: the characters /_= have no special semantic
  and do not need to be escaped by \

The easiest way to test it (Nutch 1.15):
% cat $NUTCH_HOME/conf/regex-urlfilter.txt
+^https?://nseindia\.com/live_market/dynaContent/live_analysis/top_gainers_losers\.htm\?cat=([GL])
-.
% echo "https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G)" \
   | nutch filterchecker -filterName urlfilter-regex -stdin
Checking combination of these URLFilters: RegexURLFilter
+https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G)


And with another "forbidden" URL:
% echo "https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=X)" \
  | nutch filterchecker -filterName urlfilter-regex -stdin
Checking combination of these URLFilters: RegexURLFilter
-https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=X)


Best,
Sebastian

On 12/1/18 2:45 PM, Venkata MR wrote:
> Hi Nutch Users,
> 
> I was trying to crawl the site (https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G, https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=L), with the filter patter as "+^https?://nseindia\.com\/live\_market\/dynaContent\/live\_analysis\/top\_gainers\_losers\.htm\?cat\=([GL])", it is rejecting the urls.
> 
> Tried multiple options but all the cases it is rejecting.
> 
> Any help here is appreciated, Thanks!
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> ::DISCLAIMER::
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>