You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Krishnanand, Kartik" <ka...@bankofamerica.com> on 2014/09/12 13:03:48 UTC

Crawl URL with varying query parameters values

Hi, Nutch Gurus,

I need to crawl two dynamically pages


1.       http://example.com and

2.       http://example.com?request_locale=es_US

The difference is that when the query parameter "request_locale" equals "es_US", Spanish content is loaded. We would like to be able to crawl both the URLs if possible. I have passed these urls in my seed.txt but have the logs show that only the first URL is being crawled, but not the second.

I modified the regex-normalize.xml to not strip out query parameters and is given below. How do I configure Nutch to crawl both URLs?

Kartik

<regex-normalize>

<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
  <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$4</substitution>
</regex>

<!-- changes default pages into standard for /index.html, etc. into /
<regex>
  <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
  <substitution>/$3</substitution>
</regex> -->

<!-- removes interpage href anchors such as site.com#location -->
<regex>
  <pattern>#.*?(\?|&amp;|$)</pattern>
  <substitution>$1</substitution>
</regex>

<!-- cleans ?&amp;var=value into ?var=value -->
<regex>
  <pattern>\?&amp;</pattern>
  <substitution>\?</substitution>
</regex>

<!-- cleans multiple sequential ampersands into a single ampersand -->
<regex>
  <pattern>&amp;{2,}</pattern>
  <substitution>&amp;</substitution>
</regex>

<!-- removes trailing ? -->
<regex>
  <pattern>[\?&amp;\.]$</pattern>
  <substitution></substitution>
</regex>

<!-- removes duplicate slashes -->
<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>

</regex-normalize>

----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended recipient, please delete this message.

Re: Crawl URL with varying query parameters values

Posted by Nima Falaki <nf...@popsugar.com>.

Did you also modify the regex-urlfilter.txt to not skip URLS containing
certain characters as probable queries? So put a # in the below part of
regex-urlfilter.txt

# skip URLs containing certain characters as probable queries, etc.

#-[?*!@=]

On Fri, Sep 12, 2014 at 4:03 AM, Krishnanand, Kartik <
kartik.krishnanand@bankofamerica.com> wrote:

> Hi, Nutch Gurus,
>
> I need to crawl two dynamically pages
>
>
> 1.       http://example.com and
>
> 2.       http://example.com?request_locale=es_US
>
> The difference is that when the query parameter "request_locale" equals
> "es_US", Spanish content is loaded. We would like to be able to crawl both
> the URLs if possible. I have passed these urls in my seed.txt but have the
> logs show that only the first URL is being crawled, but not the second.
>
> I modified the regex-normalize.xml to not strip out query parameters and
> is given below. How do I configure Nutch to crawl both URLs?
>
> Kartik
>
> <regex-normalize>
>
> <!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
> <regex>
>
> <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
>   <substitution>$4</substitution>
> </regex>
>
> <!-- changes default pages into standard for /index.html, etc. into /
> <regex>
>
> <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
>   <substitution>/$3</substitution>
> </regex> -->
>
> <!-- removes interpage href anchors such as site.com#location -->
> <regex>
>   <pattern>#.*?(\?|&amp;|$)</pattern>
>   <substitution>$1</substitution>
> </regex>
>
> <!-- cleans ?&amp;var=value into ?var=value -->
> <regex>
>   <pattern>\?&amp;</pattern>
>   <substitution>\?</substitution>
> </regex>
>
> <!-- cleans multiple sequential ampersands into a single ampersand -->
> <regex>
>   <pattern>&amp;{2,}</pattern>
>   <substitution>&amp;</substitution>
> </regex>
>
> <!-- removes trailing ? -->
> <regex>
>   <pattern>[\?&amp;\.]$</pattern>
>   <substitution></substitution>
> </regex>
>
> <!-- removes duplicate slashes -->
> <regex>
>   <pattern>(?&lt;!:)/{2,}</pattern>
>   <substitution>/</substitution>
> </regex>
>
> </regex-normalize>
>
> ----------------------------------------------------------------------
> This message, and any attachments, is for the intended recipient(s) only,
> may contain information that is privileged, confidential and/or proprietary
> and subject to important terms and conditions available at
> http://www.bankofamerica.com/emaildisclaimer.   If you are not the
> intended recipient, please delete this message.
>



-- 



Nima Falaki
Software Engineer
nfalaki@popsugar.com

RE: Crawl URL with varying query parameters values

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - you probably have URL filtering enabled, the regex specifically. By default 
it filters out query strings. Check your URL filters.

Markus




 
 
-----Original message-----
> From:Krishnanand, Kartik <kartik.krishnanand@bankofamerica.com <ma...@bankofamerica.com> >
> Sent: Friday 12th September 2014 13:04
> To: user@nutch.apache.org <ma...@nutch.apache.org> 
> Subject: Crawl URL with varying query parameters values
> 
> Hi, Nutch Gurus,
> 
> I need to crawl two dynamically pages
> 
> 
> 1.       http://example.com <http://example.com>  and
> 
> 2.       http://example.com <http://example.com> ?request_locale=es_US
> 
> The difference is that when the query parameter "request_locale" equals "es_US", Spanish content is loaded. We would like to be able to crawl both the URLs if possible. I have passed these urls in my seed.txt but have the logs show that only the first URL is being crawled, but not the second.
> 
> I modified the regex-normalize.xml to not strip out query parameters and is given below. How do I configure Nutch to crawl both URLs?
> 
> Kartik
> 
> <regex-normalize>
> 
> <!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
> <regex>
>   <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
>   <substitution>$4</substitution>
> </regex>
> 
> <!-- changes default pages into standard for /index.html, etc. into /
> <regex>
>   <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
>   <substitution>/$3</substitution>
> </regex> -->
> 
> <!-- removes interpage href anchors such as site.com#location -->
> <regex>
>   <pattern>#.*?(\?|&amp;|$)</pattern>
>   <substitution>$1</substitution>
> </regex>
> 
> <!-- cleans ?&amp;var=value into ?var=value -->
> <regex>
>   <pattern>\?&amp;</pattern>
>   <substitution>\?</substitution>
> </regex>
> 
> <!-- cleans multiple sequential ampersands into a single ampersand -->
> <regex>
>   <pattern>&amp;{2,}</pattern>
>   <substitution>&amp;</substitution>
> </regex>
> 
> <!-- removes trailing ? -->
> <regex>
>   <pattern>[\?&amp;\.]$</pattern>
>   <substitution></substitution>
> </regex>
> 
> <!-- removes duplicate slashes -->
> <regex>
>   <pattern>(?&lt;!:)/{2,}</pattern>
>   <substitution>/</substitution>
> </regex>
> 
> </regex-normalize>
> 
> ----------------------------------------------------------------------
> This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer <http://www.bankofamerica.com/emaildisclaimer> .   If you are not the intended recipient, please delete this message.
>