You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by lhpangler <gf...@nc.rr.com> on 2009/09/30 18:40:53 UTC

Response Code 403

I have a java program that takes a list of URLs to tika to fetch a set of
html pages.  Most of the time, I can process a list of URL with Tika. 
However, some time I get the following error:

Error: ERROR: Received the following error from Tika: Server returned HTTP
response code: 403 for URL: http://aurl.atsomeplace.com.

I can enter the url in a browser and the page appears.

Any ideas why this is happening?
Is there anyway to debug?

TIA
 
-- 
View this message in context: http://n2.nabble.com/Response-Code-403-tp3743851p3743851.html
Sent from the Apache Tika - Users mailing list archive at Nabble.com.

Re: Response Code 403

Posted by lhpangler <gf...@nc.rr.com>.
Thank you very much for your respnses.  The websites in question do not
require a username and password.  Also it does not happen on an entire
webste, but rather on some pages.

Again thanks for the ideas...
lhpangler


Cristian Vat wrote:
> 
> HTTP Response code 403 is Forbidden.
> 
> Most likely you are trying to access a URL that is password-protected or
> protected by some other means.
> Even for password-protected pages your browser might have stored the
> username+password so that's why it's working.
> 
> An alternative explanation is that Tika sends a default java user-agent to
> the server and the website blocks those ( seeing as possible crawling
> attemt
> ).
> 
> It really depends on the URL since different websites may have different
> rules for when they give or deny you access.
> 
> -
> Cristian Vat
> 
> On Wed, Sep 30, 2009 at 19:40, lhpangler <gf...@nc.rr.com> wrote:
> 
>>
>> I have a java program that takes a list of URLs to tika to fetch a set of
>> html pages.  Most of the time, I can process a list of URL with Tika.
>> However, some time I get the following error:
>>
>> Error: ERROR: Received the following error from Tika: Server returned
>> HTTP
>> response code: 403 for URL: http://aurl.atsomeplace.com.
>>
>> I can enter the url in a browser and the page appears.
>>
>> Any ideas why this is happening?
>> Is there anyway to debug?
>>
>> TIA
>>
>> --
>> View this message in context:
>> http://n2.nabble.com/Response-Code-403-tp3743851p3743851.html
>> Sent from the Apache Tika - Users mailing list archive at Nabble.com.
>>
> 
> 

-- 
View this message in context: http://n2.nabble.com/Response-Code-403-tp3743851p3744095.html
Sent from the Apache Tika - Users mailing list archive at Nabble.com.

Re: Response Code 403

Posted by lhpangler <gf...@nc.rr.com>.
Can you change the user agent to look like a browser?


Cristian Vat wrote:
> 
> HTTP Response code 403 is Forbidden.
> 
> Most likely you are trying to access a URL that is password-protected or
> protected by some other means.
> Even for password-protected pages your browser might have stored the
> username+password so that's why it's working.
> 
> An alternative explanation is that Tika sends a default java user-agent to
> the server and the website blocks those ( seeing as possible crawling
> attemt
> ).
> 
> It really depends on the URL since different websites may have different
> rules for when they give or deny you access.
> 
> -
> Cristian Vat
> 
> On Wed, Sep 30, 2009 at 19:40, lhpangler <gf...@nc.rr.com> wrote:
> 
>>
>> I have a java program that takes a list of URLs to tika to fetch a set of
>> html pages.  Most of the time, I can process a list of URL with Tika.
>> However, some time I get the following error:
>>
>> Error: ERROR: Received the following error from Tika: Server returned
>> HTTP
>> response code: 403 for URL: http://aurl.atsomeplace.com.
>>
>> I can enter the url in a browser and the page appears.
>>
>> Any ideas why this is happening?
>> Is there anyway to debug?
>>
>> TIA
>>
>> --
>> View this message in context:
>> http://n2.nabble.com/Response-Code-403-tp3743851p3743851.html
>> Sent from the Apache Tika - Users mailing list archive at Nabble.com.
>>
> 
> 

-- 
View this message in context: http://n2.nabble.com/Response-Code-403-tp3743851p3744106.html
Sent from the Apache Tika - Users mailing list archive at Nabble.com.

Re: Response Code 403

Posted by Cristian Vat <cr...@gmail.com>.
HTTP Response code 403 is Forbidden.

Most likely you are trying to access a URL that is password-protected or
protected by some other means.
Even for password-protected pages your browser might have stored the
username+password so that's why it's working.

An alternative explanation is that Tika sends a default java user-agent to
the server and the website blocks those ( seeing as possible crawling attemt
).

It really depends on the URL since different websites may have different
rules for when they give or deny you access.

-
Cristian Vat

On Wed, Sep 30, 2009 at 19:40, lhpangler <gf...@nc.rr.com> wrote:

>
> I have a java program that takes a list of URLs to tika to fetch a set of
> html pages.  Most of the time, I can process a list of URL with Tika.
> However, some time I get the following error:
>
> Error: ERROR: Received the following error from Tika: Server returned HTTP
> response code: 403 for URL: http://aurl.atsomeplace.com.
>
> I can enter the url in a browser and the page appears.
>
> Any ideas why this is happening?
> Is there anyway to debug?
>
> TIA
>
> --
> View this message in context:
> http://n2.nabble.com/Response-Code-403-tp3743851p3743851.html
> Sent from the Apache Tika - Users mailing list archive at Nabble.com.
>

Re: Response Code 403

Posted by Claudio Martella <cl...@tis.bz.it>.
lhpangler wrote:
> I have a java program that takes a list of URLs to tika to fetch a set of
> html pages.  Most of the time, I can process a list of URL with Tika. 
> However, some time I get the following error:
>
> Error: ERROR: Received the following error from Tika: Server returned HTTP
> response code: 403 for URL: http://aurl.atsomeplace.com.
>
> I can enter the url in a browser and the page appears.
>
> Any ideas why this is happening?
> Is there anyway to debug?
>
> TIA
>  
>   
403 is Access Denied Error. It could be for many reasons. In your case
the most likely is that the user-agent isn't recognized by the server.
Or you need user/password authentication. Another problem is that you're
fetching that page too often. I'd go for the user agent though.

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Engineer

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.