You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by edward choi <mp...@gmail.com> on 2010/12/09 12:05:18 UTC

Question from a Desperate Java Newbie

Excuse me for asking a general Java question here.
I tried to find Java mailing list from Google but none of them were active.

There is a problem that's been driving me crazy for a while.

I am trying to download webpages from New York Times.
With Java URL.openStream(), I can't get past the login requirement.
But with c++ socket programming (using read() and write()), I can download
any webpage just fine.

Interesting thing is that with c++, I get redirected like 10 times. Below is
the content of the header of the firstly redirected webpage when I try to
download
"
http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh
"

HTTP/1.1 302 Moved Temporarily
Server: Sun-ONE-Web-Server/6.1
Date: Thu, 09 Dec 2010 08:42:35 GMT
Content-type: text/html
Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011
08:42:35 GMT; path=/; domain=.nytimes.com
Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=.
nytimes.com
Set-cookie:
NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI;
expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com
Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=.
nytimes.com
Location:
http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Cache-control: no-cache
Pragma: no-cache
Connection: close

But with Java, I get redirected only once to a https:// webpage and it's a
dead end. Below is the result of java.net.URLConnection.getHeaderFiles()

HTTP/1.1 301 Moved Permanently,
Date: Thu, 09 Dec 2010 10:50:53 GMT,
Content-type: text/html,
Content-length: 0,
Location:
https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR
,
Server: Sun-ONE-Web-Server/6.1,

There is a clear difference between the two. I don't know why and it's been
driving me crazy.
My guess is that c++ write() function can create some kind of cookie by
itself, but Java URL.openStream() can't.

Am I right? Or can anyone explain this for me?

Re: Question from a Desperate Java Newbie

Posted by edward choi <mp...@gmail.com>.
Are you talking about java.net.HttpURLConnection?
If so, I've already tried using that with getInputStream() function. But
still no luck.

I actually got an interesting answer from Aardvark, which said that
NY Times has a policy called "Read once for free".
So obviously I crawled with C++ application first and blew my chance to
crawl with Java application.
The answerer was not sure about this policy but I think it makes sense,
because today I tried with Java crawler first and it worked just fine.

Ed

2010/12/10 Hemanth Yamijala <yh...@gmail.com>

> Not exactly what you may want - but could you try using a HTTP client
> in Java ? Some of them have the ability to automatically follow
> redirects, manage cookies etc.
>
> Thanks
> hemanth
>
> On Thu, Dec 9, 2010 at 4:35 PM, edward choi <mp...@gmail.com> wrote:
> > Excuse me for asking a general Java question here.
> > I tried to find Java mailing list from Google but none of them were
> active.
> >
> > There is a problem that's been driving me crazy for a while.
> >
> > I am trying to download webpages from New York Times.
> > With Java URL.openStream(), I can't get past the login requirement.
> > But with c++ socket programming (using read() and write()), I can
> download
> > any webpage just fine.
> >
> > Interesting thing is that with c++, I get redirected like 10 times. Below
> is
> > the content of the header of the firstly redirected webpage when I try to
> > download
> > "
> >
> http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh
> > "
> >
> > HTTP/1.1 302 Moved Temporarily
> > Server: Sun-ONE-Web-Server/6.1
> > Date: Thu, 09 Dec 2010 08:42:35 GMT
> > Content-type: text/html
> > Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011
> > 08:42:35 GMT; path=/; domain=.nytimes.com
> > Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=.
> > nytimes.com
> > Set-cookie:
> > NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI;
> > expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com
> > Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=.
> > nytimes.com
> > Location:
> > http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp
> > Expires: Thu, 01 Dec 1994 16:00:00 GMT
> > Cache-control: no-cache
> > Pragma: no-cache
> > Connection: close
> >
> > But with Java, I get redirected only once to a https:// webpage and it's
> a
> > dead end. Below is the result of java.net.URLConnection.getHeaderFiles()
> >
> > HTTP/1.1 301 Moved Permanently,
> > Date: Thu, 09 Dec 2010 10:50:53 GMT,
> > Content-type: text/html,
> > Content-length: 0,
> > Location:
> >
> https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR
> > ,
> > Server: Sun-ONE-Web-Server/6.1,
> >
> > There is a clear difference between the two. I don't know why and it's
> been
> > driving me crazy.
> > My guess is that c++ write() function can create some kind of cookie by
> > itself, but Java URL.openStream() can't.
> >
> > Am I right? Or can anyone explain this for me?
> >
>

Re: Question from a Desperate Java Newbie

Posted by edward choi <mp...@gmail.com>.
I totally obey the robots.txt since I am only fetching RSS feeds :-)
I implemented my crawler with HttpClient and it is working fine.
I often get messages about "Cookie rejected", but am able to fetch news
articles anyway.

I guess the default "java.net" client is the stateful client you mentioned.
Thanks for the tip!!

Ed

2010년 12월 16일 오전 2:18, Steve Loughran <st...@apache.org>님의 말:

> On 10/12/10 09:08, Edward Choi wrote:
> > I was wrong. It wasn't because of the "read once free" policy. I tried
> again with Java first again and this time it didn't work.
> > I looked up google and found the Http Client you mentioned. It is the one
> provided by apache, right? I guess I will have to try that one now. Thanks!
> >
>
> httpclient is good, HtmlUnit has a very good client that can simulate
> things like a full web browser with cookies, but that may be overkill.
>
> NYT's read once policy uses cookies to verify that you are there for the
> first day not logged in, for later days you get 302'd unless you delete
> the cookie, so stateful clients are bad.
>
> What you may have been hit by is whatever robot trap they have -if you
> generate too much load and don't follow the robots.txt rules they may
> detect this and push back
>
>

Re: Question from a Desperate Java Newbie

Posted by Steve Loughran <st...@apache.org>.
On 10/12/10 09:08, Edward Choi wrote:
> I was wrong. It wasn't because of the "read once free" policy. I tried again with Java first again and this time it didn't work.
> I looked up google and found the Http Client you mentioned. It is the one provided by apache, right? I guess I will have to try that one now. Thanks!
> 

httpclient is good, HtmlUnit has a very good client that can simulate
things like a full web browser with cookies, but that may be overkill.

NYT's read once policy uses cookies to verify that you are there for the
first day not logged in, for later days you get 302'd unless you delete
the cookie, so stateful clients are bad.

What you may have been hit by is whatever robot trap they have -if you
generate too much load and don't follow the robots.txt rules they may
detect this and push back


Re: Question from a Desperate Java Newbie

Posted by Edward Choi <mp...@gmail.com>.
I was wrong. It wasn't because of the "read once free" policy. I tried again with Java first again and this time it didn't work. 
I looked up google and found the Http Client you mentioned. It is the one provided by apache, right? I guess I will have to try that one now. Thanks!

From mp2893's iPhone

On 2010. 12. 10., at 오전 11:33, Hemanth Yamijala <yh...@gmail.com> wrote:

> Not exactly what you may want - but could you try using a HTTP client
> in Java ? Some of them have the ability to automatically follow
> redirects, manage cookies etc.
> 
> Thanks
> hemanth
> 
> On Thu, Dec 9, 2010 at 4:35 PM, edward choi <mp...@gmail.com> wrote:
>> Excuse me for asking a general Java question here.
>> I tried to find Java mailing list from Google but none of them were active.
>> 
>> There is a problem that's been driving me crazy for a while.
>> 
>> I am trying to download webpages from New York Times.
>> With Java URL.openStream(), I can't get past the login requirement.
>> But with c++ socket programming (using read() and write()), I can download
>> any webpage just fine.
>> 
>> Interesting thing is that with c++, I get redirected like 10 times. Below is
>> the content of the header of the firstly redirected webpage when I try to
>> download
>> "
>> http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh
>> "
>> 
>> HTTP/1.1 302 Moved Temporarily
>> Server: Sun-ONE-Web-Server/6.1
>> Date: Thu, 09 Dec 2010 08:42:35 GMT
>> Content-type: text/html
>> Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011
>> 08:42:35 GMT; path=/; domain=.nytimes.com
>> Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=.
>> nytimes.com
>> Set-cookie:
>> NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI;
>> expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com
>> Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=.
>> nytimes.com
>> Location:
>> http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp
>> Expires: Thu, 01 Dec 1994 16:00:00 GMT
>> Cache-control: no-cache
>> Pragma: no-cache
>> Connection: close
>> 
>> But with Java, I get redirected only once to a https:// webpage and it's a
>> dead end. Below is the result of java.net.URLConnection.getHeaderFiles()
>> 
>> HTTP/1.1 301 Moved Permanently,
>> Date: Thu, 09 Dec 2010 10:50:53 GMT,
>> Content-type: text/html,
>> Content-length: 0,
>> Location:
>> https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR
>> ,
>> Server: Sun-ONE-Web-Server/6.1,
>> 
>> There is a clear difference between the two. I don't know why and it's been
>> driving me crazy.
>> My guess is that c++ write() function can create some kind of cookie by
>> itself, but Java URL.openStream() can't.
>> 
>> Am I right? Or can anyone explain this for me?
>> 

Re: Question from a Desperate Java Newbie

Posted by edward choi <mp...@gmail.com>.
I would, but I am trying to integrate the crawler with Hadoop, so I wanted
to write in Java :-)

2010/12/10 Santosh Borse <sa...@persistent.co.in>

> You can use open source wget as well.
>
> -----Original Message-----
> From: Hemanth Yamijala [mailto:yhemanth@gmail.com]
> Sent: Friday, December 10, 2010 8:04 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Question from a Desperate Java Newbie
>
> Not exactly what you may want - but could you try using a HTTP client
> in Java ? Some of them have the ability to automatically follow
> redirects, manage cookies etc.
>
> Thanks
> hemanth
>
> On Thu, Dec 9, 2010 at 4:35 PM, edward choi <mp...@gmail.com> wrote:
> > Excuse me for asking a general Java question here.
> > I tried to find Java mailing list from Google but none of them were
> active.
> >
> > There is a problem that's been driving me crazy for a while.
> >
> > I am trying to download webpages from New York Times.
> > With Java URL.openStream(), I can't get past the login requirement.
> > But with c++ socket programming (using read() and write()), I can
> download
> > any webpage just fine.
> >
> > Interesting thing is that with c++, I get redirected like 10 times. Below
> is
> > the content of the header of the firstly redirected webpage when I try to
> > download
> > "
> >
> http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh
> > "
> >
> > HTTP/1.1 302 Moved Temporarily
> > Server: Sun-ONE-Web-Server/6.1
> > Date: Thu, 09 Dec 2010 08:42:35 GMT
> > Content-type: text/html
> > Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011
> > 08:42:35 GMT; path=/; domain=.nytimes.com
> > Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=.
> > nytimes.com
> > Set-cookie:
> > NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI;
> > expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com
> > Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=.
> > nytimes.com
> > Location:
> > http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp
> > Expires: Thu, 01 Dec 1994 16:00:00 GMT
> > Cache-control: no-cache
> > Pragma: no-cache
> > Connection: close
> >
> > But with Java, I get redirected only once to a https:// webpage and it's
> a
> > dead end. Below is the result of java.net.URLConnection.getHeaderFiles()
> >
> > HTTP/1.1 301 Moved Permanently,
> > Date: Thu, 09 Dec 2010 10:50:53 GMT,
> > Content-type: text/html,
> > Content-length: 0,
> > Location:
> >
> https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR
> > ,
> > Server: Sun-ONE-Web-Server/6.1,
> >
> > There is a clear difference between the two. I don't know why and it's
> been
> > driving me crazy.
> > My guess is that c++ write() function can create some kind of cookie by
> > itself, but Java URL.openStream() can't.
> >
> > Am I right? Or can anyone explain this for me?
> >
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

RE: Question from a Desperate Java Newbie

Posted by Santosh Borse <sa...@persistent.co.in>.
You can use open source wget as well.

-----Original Message-----
From: Hemanth Yamijala [mailto:yhemanth@gmail.com] 
Sent: Friday, December 10, 2010 8:04 AM
To: common-user@hadoop.apache.org
Subject: Re: Question from a Desperate Java Newbie

Not exactly what you may want - but could you try using a HTTP client
in Java ? Some of them have the ability to automatically follow
redirects, manage cookies etc.

Thanks
hemanth

On Thu, Dec 9, 2010 at 4:35 PM, edward choi <mp...@gmail.com> wrote:
> Excuse me for asking a general Java question here.
> I tried to find Java mailing list from Google but none of them were active.
>
> There is a problem that's been driving me crazy for a while.
>
> I am trying to download webpages from New York Times.
> With Java URL.openStream(), I can't get past the login requirement.
> But with c++ socket programming (using read() and write()), I can download
> any webpage just fine.
>
> Interesting thing is that with c++, I get redirected like 10 times. Below is
> the content of the header of the firstly redirected webpage when I try to
> download
> "
> http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh
> "
>
> HTTP/1.1 302 Moved Temporarily
> Server: Sun-ONE-Web-Server/6.1
> Date: Thu, 09 Dec 2010 08:42:35 GMT
> Content-type: text/html
> Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011
> 08:42:35 GMT; path=/; domain=.nytimes.com
> Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=.
> nytimes.com
> Set-cookie:
> NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI;
> expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com
> Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=.
> nytimes.com
> Location:
> http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp
> Expires: Thu, 01 Dec 1994 16:00:00 GMT
> Cache-control: no-cache
> Pragma: no-cache
> Connection: close
>
> But with Java, I get redirected only once to a https:// webpage and it's a
> dead end. Below is the result of java.net.URLConnection.getHeaderFiles()
>
> HTTP/1.1 301 Moved Permanently,
> Date: Thu, 09 Dec 2010 10:50:53 GMT,
> Content-type: text/html,
> Content-length: 0,
> Location:
> https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR
> ,
> Server: Sun-ONE-Web-Server/6.1,
>
> There is a clear difference between the two. I don't know why and it's been
> driving me crazy.
> My guess is that c++ write() function can create some kind of cookie by
> itself, but Java URL.openStream() can't.
>
> Am I right? Or can anyone explain this for me?
>

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: Question from a Desperate Java Newbie

Posted by Hemanth Yamijala <yh...@gmail.com>.
Not exactly what you may want - but could you try using a HTTP client
in Java ? Some of them have the ability to automatically follow
redirects, manage cookies etc.

Thanks
hemanth

On Thu, Dec 9, 2010 at 4:35 PM, edward choi <mp...@gmail.com> wrote:
> Excuse me for asking a general Java question here.
> I tried to find Java mailing list from Google but none of them were active.
>
> There is a problem that's been driving me crazy for a while.
>
> I am trying to download webpages from New York Times.
> With Java URL.openStream(), I can't get past the login requirement.
> But with c++ socket programming (using read() and write()), I can download
> any webpage just fine.
>
> Interesting thing is that with c++, I get redirected like 10 times. Below is
> the content of the header of the firstly redirected webpage when I try to
> download
> "
> http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh
> "
>
> HTTP/1.1 302 Moved Temporarily
> Server: Sun-ONE-Web-Server/6.1
> Date: Thu, 09 Dec 2010 08:42:35 GMT
> Content-type: text/html
> Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011
> 08:42:35 GMT; path=/; domain=.nytimes.com
> Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=.
> nytimes.com
> Set-cookie:
> NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI;
> expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com
> Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=.
> nytimes.com
> Location:
> http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp
> Expires: Thu, 01 Dec 1994 16:00:00 GMT
> Cache-control: no-cache
> Pragma: no-cache
> Connection: close
>
> But with Java, I get redirected only once to a https:// webpage and it's a
> dead end. Below is the result of java.net.URLConnection.getHeaderFiles()
>
> HTTP/1.1 301 Moved Permanently,
> Date: Thu, 09 Dec 2010 10:50:53 GMT,
> Content-type: text/html,
> Content-length: 0,
> Location:
> https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR
> ,
> Server: Sun-ONE-Web-Server/6.1,
>
> There is a clear difference between the two. I don't know why and it's been
> driving me crazy.
> My guess is that c++ write() function can create some kind of cookie by
> itself, but Java URL.openStream() can't.
>
> Am I right? Or can anyone explain this for me?
>