You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kumar Krishnasami <ku...@vembu.com> on 2010/01/08 14:01:31 UTC
Enabling Query Strings in *filter.txt files
Hi All,
I have some urls that need to be crawled that have a query string in
them. I've commented out the appropriate line in crawl_urlfilter.txt and
regex-urlfilter.txt to enable crawling of urls that contain a '?' in them.
If I crawl urls like: http://queue.acm.org/detail.cfm?id=988409
everything is fine.
However, when I crawl an url like:
http://search.techcrunch.com/query.php?s=local the crawl completes
without any error. But, if I look into the indexes there is nothing from
this particular url. If I search for some terms from this url I get 0 hits.
Is it because these sites maybe blocking nutch from crawling or am I
missing something related to query strings?
Also, I donot see any error messages in the console when the crawl is
happening. It seems like everything is running fine but then the index
doesn't get created.
Thanks,
Kumar.
Re: Enabling Query Strings in *filter.txt files
Posted by Kumar Krishnasami <ku...@vembu.com>.
Thanks again, Mischa. I checked the robots.txt file for
search.techcrunch.com and it disallows all crawlers.
Mischa Tuffield wrote:
> Hi Kumar,
>
> You could try using curl and sending the accept headers your nutch installation exposes. These are set in conf/nutch-site.xml, this would at least help you eliminate the idea that techcrunch is blocking your instance of nutch.
>
> Mischa
> On 8 Jan 2010, at 13:01, Kumar Krishnasami wrote:
>
>
>> Hi All,
>>
>> I have some urls that need to be crawled that have a query string in them. I've commented out the appropriate line in crawl_urlfilter.txt and regex-urlfilter.txt to enable crawling of urls that contain a '?' in them.
>>
>> If I crawl urls like: http://queue.acm.org/detail.cfm?id=988409 everything is fine.
>>
>> However, when I crawl an url like: http://search.techcrunch.com/query.php?s=local the crawl completes without any error. But, if I look into the indexes there is nothing from this particular url. If I search for some terms from this url I get 0 hits.
>>
>> Is it because these sites maybe blocking nutch from crawling or am I missing something related to query strings?
>>
>> Also, I donot see any error messages in the console when the crawl is happening. It seems like everything is running fine but then the index doesn't get created.
>>
>> Thanks,
>> Kumar.
>>
>
> ___________________________________
> Mischa Tuffield
> Email: mischa.tuffield@garlik.com
> Homepage - http://mmt.me.uk/
> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
> +44(0)20 8973 2465 http://www.garlik.com/
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>
>
>
Re: Enabling Query Strings in *filter.txt files
Posted by Mischa Tuffield <mi...@garlik.com>.
Hi Kumar,
You could try using curl and sending the accept headers your nutch installation exposes. These are set in conf/nutch-site.xml, this would at least help you eliminate the idea that techcrunch is blocking your instance of nutch.
Mischa
On 8 Jan 2010, at 13:01, Kumar Krishnasami wrote:
> Hi All,
>
> I have some urls that need to be crawled that have a query string in them. I've commented out the appropriate line in crawl_urlfilter.txt and regex-urlfilter.txt to enable crawling of urls that contain a '?' in them.
>
> If I crawl urls like: http://queue.acm.org/detail.cfm?id=988409 everything is fine.
>
> However, when I crawl an url like: http://search.techcrunch.com/query.php?s=local the crawl completes without any error. But, if I look into the indexes there is nothing from this particular url. If I search for some terms from this url I get 0 hits.
>
> Is it because these sites maybe blocking nutch from crawling or am I missing something related to query strings?
>
> Also, I donot see any error messages in the console when the crawl is happening. It seems like everything is running fine but then the index doesn't get created.
>
> Thanks,
> Kumar.
___________________________________
Mischa Tuffield
Email: mischa.tuffield@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465 http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD