You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kumar Krishnasami <ku...@vembu.com> on 2010/01/08 14:01:31 UTC

Enabling Query Strings in *filter.txt files

Hi All,

I have some urls that need to be crawled that have a query string in 
them. I've commented out the appropriate line in crawl_urlfilter.txt and 
regex-urlfilter.txt to enable crawling of urls that contain a '?' in them.

If I crawl urls like: http://queue.acm.org/detail.cfm?id=988409    
everything is fine.

However, when I crawl an url like: 
http://search.techcrunch.com/query.php?s=local the crawl completes 
without any error. But, if I look into the indexes there is nothing from 
this particular url. If I search for some terms from this url I get 0 hits.

Is it because these sites maybe blocking nutch from crawling or am I 
missing something related to query strings?

Also, I donot see any error messages in the console when the crawl is 
happening. It seems like everything is running fine but then the index 
doesn't get created.

Thanks,
Kumar.

Re: Enabling Query Strings in *filter.txt files

Posted by Kumar Krishnasami <ku...@vembu.com>.

Thanks again, Mischa. I checked the robots.txt file for 
search.techcrunch.com and it disallows all crawlers.

Mischa Tuffield wrote:
> Hi Kumar, 
>
> You could try using curl and sending the accept headers your nutch installation exposes. These are set in conf/nutch-site.xml, this would at least help you eliminate the idea that techcrunch is blocking your instance of nutch. 
>
> Mischa
> On 8 Jan 2010, at 13:01, Kumar Krishnasami wrote:
>
>   
>> Hi All,
>>
>> I have some urls that need to be crawled that have a query string in them. I've commented out the appropriate line in crawl_urlfilter.txt and regex-urlfilter.txt to enable crawling of urls that contain a '?' in them.
>>
>> If I crawl urls like: http://queue.acm.org/detail.cfm?id=988409    everything is fine.
>>
>> However, when I crawl an url like: http://search.techcrunch.com/query.php?s=local the crawl completes without any error. But, if I look into the indexes there is nothing from this particular url. If I search for some terms from this url I get 0 hits.
>>
>> Is it because these sites maybe blocking nutch from crawling or am I missing something related to query strings?
>>
>> Also, I donot see any error messages in the console when the crawl is happening. It seems like everything is running fine but then the index doesn't get created.
>>
>> Thanks,
>> Kumar.
>>     
>
> ___________________________________
> Mischa Tuffield
> Email: mischa.tuffield@garlik.com
> Homepage - http://mmt.me.uk/
> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
> +44(0)20 8973 2465  http://www.garlik.com/
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>
>
>

Re: Enabling Query Strings in *filter.txt files

Posted by Mischa Tuffield <mi...@garlik.com>.

Hi Kumar, 

You could try using curl and sending the accept headers your nutch installation exposes. These are set in conf/nutch-site.xml, this would at least help you eliminate the idea that techcrunch is blocking your instance of nutch. 

Mischa
On 8 Jan 2010, at 13:01, Kumar Krishnasami wrote:

> Hi All,
> 
> I have some urls that need to be crawled that have a query string in them. I've commented out the appropriate line in crawl_urlfilter.txt and regex-urlfilter.txt to enable crawling of urls that contain a '?' in them.
> 
> If I crawl urls like: http://queue.acm.org/detail.cfm?id=988409    everything is fine.
> 
> However, when I crawl an url like: http://search.techcrunch.com/query.php?s=local the crawl completes without any error. But, if I look into the indexes there is nothing from this particular url. If I search for some terms from this url I get 0 hits.
> 
> Is it because these sites maybe blocking nutch from crawling or am I missing something related to query strings?
> 
> Also, I donot see any error messages in the console when the crawl is happening. It seems like everything is running fine but then the index doesn't get created.
> 
> Thanks,
> Kumar.

___________________________________
Mischa Tuffield
Email: mischa.tuffield@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD