You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by Vincent Chain <vc...@yahoo.com> on 2004/07/11 09:15:08 UTC

Cannot use HttpClient to search google

I am encountering an interesting issue, and I guess the issue is probably not within the HttpClient itself, but I haven't figured out how to make it work yet. I am using the 2.0 version.
 
What I try to do is simple enough: I want to use HttpClient to simulate a typical browser request to search in google. For example a query like
 
http://www.google.com/search?hl=en&ie=UTF-8&q=sql+server+trace
 
And I used some code like below
 
client = new HttpClient();
m = new GetMethod("http://www.google.com/search?hl=en&ie=UTF-8&q=sql+server+trace");
s = client.executeMethod(m);
 
Now s is always 403 for me (and this 403 should have nothing to do with Proxy), and the content of the response is basically google saying that the request is forbidden (because it reaches a host that the client is not supposed to)... the response is too large for this email, but it looks like this
 
<html><head><title>403 Forbidden</title>....
<blockquote><H1>Forbidden</H1>Your client does not have permission to get URL <code>/search?hl=en&amp;ie=UTF-8&amp;q=sql+server+trace</code> from this server.  (Client IP address: xx.xx.xx.xx)<br><br>Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html
....
 
I guess the main reason is google uses akamai's network to distribute loads. On my server when I do an nslookup of google, I can see that the DNS records returned have very short valid duration: from several seconds to a couple of minutes. I guess this way the browser will be forced to issue another DNS query the next time I do a search. The issue when I use HttpClient however is it always uses a certain IP for www.google.com and seems to ignore the short life of the DNS entry. I think because HttpClient opened a socket to this 'old' IP address google somehow figured it's not a valid request and rejected it.
 
I did a quick check on the HttpClient code and it seems to me the Socket it uses to open the connection is implemented from java.net.Socket (DefaultProtocolSocketFactory::createSocket), so I guess HttpClient is not directly responsible for the problem here... Nevertheless, I wonder if any one having similar issue as I do? Especially considering some of the HttpClient sample codes uses http://www.google.com then it should have similar problems?
 
Thanks a lot for any tips.
 
 
 
 

		
---------------------------------
Do you Yahoo!?
Yahoo! Mail is new and improved - Check it out!

Re: Cannot use HttpClient to search google

Posted by Adrian Sutton <ad...@intencha.com>.
> <html><head><title>403 Forbidden</title>....
> <blockquote><H1>Forbidden</H1>Your client does not have permission to 
> get URL <code>/search?hl=en&amp;ie=UTF-8&amp;q=sql+server+trace</code> 
> from this server.  (Client IP address: xx.xx.xx.xx)<br><br>Please see 
> Google's Terms of Service posted at 
> http://www.google.com/terms_of_service.html
> ....
>
> I guess the main reason is google uses akamai's network to distribute 
> loads.

No it means you should read the terms of service, specifically the part 
about not using "screen-scraping" techniques to programatically perform 
searches (which is what you're trying to do).  You should use the 
Google SOAP search service instead as it will make your life a lot 
easier.

It would not be appropriate to discuss ways around the technical 
limitations Google uses to enforce their terms of service on an Apache 
Software Foundation mailing list.

The particular section of the Google ToS that I believe applies here is 
listed under the "Personal Use Only" and "No Automated Querying" 
headings.

Information on Google's SOAP APIs is available at 
http://www.google.com.au/apis/ (note they also have terms of service)

Finally, sorry if this seems abrupt, it is important for the ASF to 
clearly not support use of their products in any way that may cause 
legal trouble.  If you feel you are following the terms of services for 
Google and I've missed something then my apologies.

Regards,

Adrian Sutton.

----------------------------------------------
Intencha "tomorrow's technology today"
Ph: 38478913 0422236329
Suite 8/29 Oatland Crescent
Holland Park West 4121
Australia QLD
www.intencha.com