You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rout Biswajit-B16078 <B1...@freescale.com> on 2008/09/15 14:37:07 UTC

Not able to crawl password protected pages using NUTCH 0.9

Hi,
I have successfully configured NUTCH 0.9, which is crawling number of
sites and after that searching is also happening properly.
However, now I want to crawl password protected pages using NUTCH. In
order to access those pages I should have a valid user name and
password. I have configured the user name and password in my
nutch-site.xml and httpclient-auth.xml
However it is not crawling.
I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in
the Zip file for your reference. Kindly check and let me know what is
missing from my end.
CONFIGURATION:
nutch-2008-07-10_04-01-48.tar (I have download from 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/
<http://az33exf40.fsl.freescale.net/exchweb/bin/redir.asp?URL=http://az3
3exf40.fsl.freescale.net/exchweb/bin/redir.asp?URL=http://hudson.zones.a
pache.org/hudson/job/Nutch-trunk/>  which contains your patch for
HttpAuthentication)
 
Windows XP
Cygwin
jdk1.6.0
 
Thanks in advance...
Please help....
 
Best regards,
Biswajit
 

Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by Kunthar <ku...@gmail.com>.
Stop trolling me...



Rout Biswajit-B16078 yazmış:
>
> Hi,
>
> I have successfully configured NUTCH 0.9, which is crawling number of 
> sites and after that searching is also happening properly.
>
> However, now I want to crawl password protected pages using NUTCH. In 
> order to access those pages I should have a valid user name and 
> password. I have configured the user name and password in my 
> nutch-site.xml and httpclient-auth.xml
>
> However it is not crawling.
>
> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in 
> the Zip file for your reference. Kindly check and let me know what is 
> missing from my end.
>
> */_CONFIGURATION:_/**/__/*
>
> nutch-2008-07-10_04-01-48.tar (I have download from 
> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ 
> <http://az33exf40.fsl.freescale.net/exchweb/bin/redir.asp?URL=http://az33exf40.fsl.freescale.net/exchweb/bin/redir.asp?URL=http://hudson.zones.apache.org/hudson/job/Nutch-trunk/> 
> which contains your patch for HttpAuthentication)
>
>  
>
> Windows XP
>
> Cygwin
>
> jdk1.6.0
>
>  
>
> Thanks in advance…
>
> Please help....
>
>  
>
> Best regards,
>
> Biswajit
>
>  
>


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by biswajit_rout <bi...@lntinfotech.com>.
Hi Susam,

Please give a look into new.txt and suggest a solution for this. This time i
have crawled another site. I am able to crawl all the public pages but
password protected pages crawling is not happening...

Best regards,
Biswajit.


biswajit_rout wrote:
> 
> Hi,
> 
> There is nothing to crawl in the home page of
> http://10.222.18.113:8080/dao/.
> 
> So this time i have crawled another site. I have successfully crawled all
> the public pages but not able to crawl private pages.
> I have attached a log file(new.log). Can you please check and let me know
> what needs to be done from my end???
> 
> Best regards,
> Biswajit.
> 
> 
> Susam Pal wrote:
>> 
>> The log file shows only one fetching line:
>> 
>> 2008-09-16 20:46:15,321 INFO  fetcher.Fetcher - fetching
>> http://10.222.18.113:8080/dao/
>> 
>> This has been fetched successfully. There is no other page being
>> fetched. Have you set up Nutch properly so that it can fetch all the
>> pages you need? If it tries to fetch a page but fails due to
>> authentication, then it is a problem with authentication.
>> 
>> In this case, it is not even attempting to fetch those pages. So, the
>> problem lies elsewhere. You need to first find out why it is fetching
>> only one page and not others.
>> 
>> Regards,
>> Susam Pal
>> 
>> On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
>> <bi...@lntinfotech.com> wrote:
>>>
>>> But still it is not crawling the password protected pages...
>>>
>>> Regards,
>>> Biswajit.
>>>
>>>
>>> Susam Pal wrote:
>>>>
>>>> The latest log shows that the page from the URL:
>>>> http://10.222.18.113:8080/dao/ has been fetched successfully.
>>>>
>>>> Regards,
>>>> Susam Pal
>>>>
>>>> On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
>>>> <bi...@lntinfotech.com> wrote:
>>>>>
>>>>> Hi Susam,
>>>>>
>>>>> Please find the latest log file(latest.log), which shows different
>>>>> error.
>>>>>
>>>>> 2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
>>>>> http://10.222.18.113:8080/robots.txt; status code: 404; bytes
>>>>> received:
>>>>> 985;
>>>>> Content-Length: 985
>>>>> 2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
>>>>> http://10.222.18.113:8080/dao/; status code: 200; bytes received:
>>>>> 1941;
>>>>> Content-Length: 1941
>>>>>
>>>>> Thanks in advance...
>>>>>
>>>>> Best regards,
>>>>> Biswajit.
>>>>>
>>>>>
>>>>> biswajit_rout wrote:
>>>>>>
>>>>>> Hi Susam,
>>>>>>
>>>>>> Thanks for your immediate response...
>>>>>> Herewith i am attaching the debug enabled log
>>>>>> file(debugenabled_hadoop.log). Kindly go through the file and let me
>>>>>> know
>>>>>> what is missing from my end...
>>>>>>
>>>>>> Best regards,
>>>>>> Biswajit.
>>>>>>
>>>>>>
>>>>>> Susam Pal wrote:
>>>>>>>
>>>>>>> Hi Biswajit,
>>>>>>>
>>>>>>> The authscope specifies which IP address or domain-name would the
>>>>>>> credentials be used for. If you provide 10.222.18.113 in the
>>>>>>> authscope, the credentials would not be used for localhost even
>>>>>>> though
>>>>>>> both represent the same machine.
>>>>>>>
>>>>>>> Please provide logs with DEBUG enabled.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Susam Pal
>>>>>>>
>>>>>>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>
>>>>>>>> Hi Susam,
>>>>>>>>
>>>>>>>> The ip 10.222.18.113 is nothing but the ip address of my
>>>>>>>> machine(localhost).
>>>>>>>> Now also i changed http://localhost:8080/ to
>>>>>>>> http://10.222.18.113:8080.
>>>>>>>> However no result, i mean to say still not able to crawl password
>>>>>>>> protected
>>>>>>>> pages.
>>>>>>>>
>>>>>>>> Kindly assist me to resolve this issue.
>>>>>>>>
>>>>>>>> Thanks in advance...
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Biswajit.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Susam Pal wrote:
>>>>>>>>>
>>>>>>>>> The logs show that it is fetching http://localhost:8080/ but you
>>>>>>>>> have
>>>>>>>>> set credentials for 10.222.18.113:8080 which is never being
>>>>>>>>> fetched.
>>>>>>>>> So, no authentication takes place.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Susam Pal
>>>>>>>>>
>>>>>>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Susam,
>>>>>>>>>>
>>>>>>>>>> In order to crawl password protected pages, I am using
>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>> contains
>>>>>>>>>> your
>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>
>>>>>>>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>>>>>>>
>>>>>>>>>> Please find the attached zip file which contains
>>>>>>>>>> nutch-site.xml,httpclient-auth.xml.
>>>>>>>>>>
>>>>>>>>>> Kindly provide me a solution for this.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Biswajit
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Biswajit,
>>>>>>>>>>>
>>>>>>>>>>> Could you please tell us how you have added the support for
>>>>>>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>>>>>>>>>> properly by default. The authentication feature is buggy in
>>>>>>>>>>> Nutch
>>>>>>>>>>> 0.9
>>>>>>>>>>> which was fixed with this ticket:
>>>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>>>>>>>
>>>>>>>>>>> The feature is documented here:
>>>>>>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>>>>>>>
>>>>>>>>>>> The easiest way to use it is to check out the latest version of
>>>>>>>>>>> Nutch
>>>>>>>>>>> and build it as it contains the authentication feature. If you
>>>>>>>>>>> want
>>>>>>>>>>> to
>>>>>>>>>>> use it with Nutch 0.9, you have to download the latest patch
>>>>>>>>>>> present
>>>>>>>>>>> in the ticket page and apply it to the source code and build it.
>>>>>>>>>>> You
>>>>>>>>>>> might have to resolve some conflicts manually.
>>>>>>>>>>>
>>>>>>>>>>> I would suggest that you do not send the mail same mail multiple
>>>>>>>>>>> times. We have received the same mail from you 4 times. It takes
>>>>>>>>>>> sometime for members to reply to a mail. :-)
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Susam Pal
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>>>>>>>> <B1...@freescale.com> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I have successfully configured NUTCH 0.9, which is crawling
>>>>>>>>>>>> number
>>>>>>>>>>>> of
>>>>>>>>>>>> sites
>>>>>>>>>>>> and after that searching is also happening properly.
>>>>>>>>>>>>
>>>>>>>>>>>> However, now I want to crawl password protected pages using
>>>>>>>>>>>> NUTCH.
>>>>>>>>>>>> In
>>>>>>>>>>>> order
>>>>>>>>>>>> to access those pages I should have a valid user name and
>>>>>>>>>>>> password.
>>>>>>>>>>>> I
>>>>>>>>>>>> have
>>>>>>>>>>>> configured the user name and password in my nutch-site.xml and
>>>>>>>>>>>> httpclient-auth.xml
>>>>>>>>>>>>
>>>>>>>>>>>> However it is not crawling.
>>>>>>>>>>>>
>>>>>>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and
>>>>>>>>>>>> hadoop.log
>>>>>>>>>>>> in
>>>>>>>>>>>> the
>>>>>>>>>>>> Zip file for your reference. Kindly check and let me know what
>>>>>>>>>>>> is
>>>>>>>>>>>> missing
>>>>>>>>>>>> from my end.
>>>>>>>>>>>>
>>>>>>>>>>>> CONFIGURATION:
>>>>>>>>>>>>
>>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>>>> contains
>>>>>>>>>>>> your
>>>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Windows XP
>>>>>>>>>>>>
>>>>>>>>>>>> Cygwin
>>>>>>>>>>>>
>>>>>>>>>>>> jdk1.6.0
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance…
>>>>>>>>>>>>
>>>>>>>>>>>> Please help....
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Biswajit
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>>>>>>>> --
>>>>>>>>>> View this message in context:
>>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>  http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
>>>>>> debugenabled_hadoop.log
>>>>>>
>>>>> http://www.nabble.com/file/p19514374/latest.log latest.log
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
>  http://www.nabble.com/file/p19552519/new.txt new.txt 
> 

-- 
View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566500.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by biswajit_rout <bi...@lntinfotech.com>.
Hi,

It's regarding the indexing which NUTCH is doing...

>From the below log we can see, it has indexed 75 number of docs. Now my
question is what is the maximum value of this indexing? Can we control this
value to some limit?

2008-09-24 13:13:23,390 INFO  indexer.Indexer - Optimizing index.
2008-09-24 13:13:23,406 INFO  indexer.Indexer - merging segments _ram_1e (1
docs) _ram_1f (1 docs) _ram_1g (1 docs) _ram_1h (1 docs) _ram_1i (1 docs)
_ram_1j (1 docs) _ram_1k (1 docs) _ram_1l (1 docs) _ram_1m (1 docs) _ram_1n
(1 docs) _ram_1o (1 docs) _ram_1p (1 docs) _ram_1q (1 docs) _ram_1r (1 docs)
_ram_1s (1 docs) _ram_1t (1 docs) _ram_1u (1 docs) _ram_1v (1 docs) _ram_1w
(1 docs) _ram_1x (1 docs) _ram_1y (1 docs) _ram_1z (1 docs) _ram_20 (1 docs)
_ram_21 (1 docs) _ram_22 (1 docs) into _1 (25 docs)
2008-09-24 13:13:23,437 INFO  indexer.Indexer - merging segments _0 (50
docs) _1 (25 docs) into _2 (75 docs)
2008-09-24 13:13:24,216 INFO  indexer.Indexer - Indexer: done
2008-09-24 13:13:24,216 INFO  indexer.DeleteDuplicates - Dedup: starting
2008-09-24 13:13:24,232 INFO  indexer.DeleteDuplicates - Dedup: adding
indexes in: crawl/indexes
2008-09-24 13:13:27,723 INFO  indexer.DeleteDuplicates - Dedup: done
2008-09-24 13:13:27,723 INFO  indexer.IndexMerger - merging indexes to:
crawl/index
2008-09-24 13:13:27,738 INFO  indexer.IndexMerger - Adding
crawl/indexes/part-00000
2008-09-24 13:13:27,816 INFO  indexer.IndexMerger - done merging
2008-09-24 13:13:27,832 INFO  crawl.Crawl - crawl finished: crawl

Best regards,
Biswajit.


Susam Pal wrote:
> 
> Replies inline.
> 
> On Mon, Sep 22, 2008 at 1:40 PM, biswajit_rout
> <bi...@lntinfotech.com> wrote:
>>
>> Hi Susam,
>>
>> I saw,
>> http://www.nabble.com/Problems-testing-Authentication-td13991771.html#a13995888
>>
>> Where your patch is used for web server (Tomcat) manager authentication.
> 
> My patch is used for Basic, Digest or NTLM authentication schemes
> only. It works with Tomcat probably because it uses Basic
> authentication to login to the manager. I hope you have read this:
> http://wiki.apache.org/nutch/HttpAuthenticationSchemes.
> 
>>
>> However my requirement is different…
>>
>> I am trying to crawl sites which are password protected just like gmail.
>> That means I have to pass correct user name and password, after that I
>> will
>> be able to see all the pages / modules. So for this, is there any
>> different
>> configuration mechanism is there?
> 
> 
> POST based authentication is not done. You might want to read this:
> http://wiki.apache.org/nutch/HttpPostAuthentication
> 
> Regards,
> Susam Pal
> 
>>
>> Could you please let me know???
>>
>> Best regards,
>> Biswajit.
>>
>>
>>
>> Susam Pal wrote:
>>>
>>> Hi Biswajit,
>>>
>>> I don't find a single error caused due to authentication problem in
>>> the 'new.txt' file you have attached in some mail before.. Most of
>>> them are HTTP 404 or HTTP 302 errors, which means either the page is
>>> not available or the page has been moved to another location, which
>>> the crawler would try to fetch. There's nothing I can do to help you
>>> in this matter. You have access to the network and you can analyze
>>> better why this is happening. Please do not send the same mail
>>> multiple time. As, I have told you before, it takes time for members
>>> to respond as they do so only in their free time.
>>>
>>> Regards,
>>> Susam Pal
>>>
>>> On Fri, Sep 19, 2008 at 5:38 AM, biswajit_rout
>>> <bi...@lntinfotech.com> wrote:
>>>>
>>>> Hi Susam,
>>>>
>>>> Please give a look into the attached file (new.txt) and suggest a
>>>> solution
>>>> for this. This time i have crawled another site. I am able to crawl all
>>>> the
>>>> public pages but password protected pages crawling is not happening...
>>>>
>>>> Best regards,
>>>> Biswajit.
>>>>
>>>>
>>>> biswajit_rout wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> There is nothing to crawl in the home page of
>>>>> http://10.222.18.113:8080/dao/.
>>>>>
>>>>> So this time i have crawled another site. I have successfully crawled
>>>>> all
>>>>> the public pages but not able to crawl private pages.
>>>>> I have attached a log file(new.log). Can you please check and let me
>>>>> know
>>>>> what needs to be done from my end???
>>>>>
>>>>> Best regards,
>>>>> Biswajit.
>>>>>
>>>>>
>>>>> Susam Pal wrote:
>>>>>>
>>>>>> The log file shows only one fetching line:
>>>>>>
>>>>>> 2008-09-16 20:46:15,321 INFO  fetcher.Fetcher - fetching
>>>>>> http://10.222.18.113:8080/dao/
>>>>>>
>>>>>> This has been fetched successfully. There is no other page being
>>>>>> fetched. Have you set up Nutch properly so that it can fetch all the
>>>>>> pages you need? If it tries to fetch a page but fails due to
>>>>>> authentication, then it is a problem with authentication.
>>>>>>
>>>>>> In this case, it is not even attempting to fetch those pages. So, the
>>>>>> problem lies elsewhere. You need to first find out why it is fetching
>>>>>> only one page and not others.
>>>>>>
>>>>>> Regards,
>>>>>> Susam Pal
>>>>>>
>>>>>> On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>
>>>>>>> But still it is not crawling the password protected pages...
>>>>>>>
>>>>>>> Regards,
>>>>>>> Biswajit.
>>>>>>>
>>>>>>>
>>>>>>> Susam Pal wrote:
>>>>>>>>
>>>>>>>> The latest log shows that the page from the URL:
>>>>>>>> http://10.222.18.113:8080/dao/ has been fetched successfully.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Susam Pal
>>>>>>>>
>>>>>>>> On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Susam,
>>>>>>>>>
>>>>>>>>> Please find the latest log file(latest.log), which shows different
>>>>>>>>> error.
>>>>>>>>>
>>>>>>>>> 2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
>>>>>>>>> http://10.222.18.113:8080/robots.txt; status code: 404; bytes
>>>>>>>>> received:
>>>>>>>>> 985;
>>>>>>>>> Content-Length: 985
>>>>>>>>> 2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
>>>>>>>>> http://10.222.18.113:8080/dao/; status code: 200; bytes received:
>>>>>>>>> 1941;
>>>>>>>>> Content-Length: 1941
>>>>>>>>>
>>>>>>>>> Thanks in advance...
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Biswajit.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> biswajit_rout wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Susam,
>>>>>>>>>>
>>>>>>>>>> Thanks for your immediate response...
>>>>>>>>>> Herewith i am attaching the debug enabled log
>>>>>>>>>> file(debugenabled_hadoop.log). Kindly go through the file and let
>>>>>>>>>> me
>>>>>>>>>> know
>>>>>>>>>> what is missing from my end...
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Biswajit.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Biswajit,
>>>>>>>>>>>
>>>>>>>>>>> The authscope specifies which IP address or domain-name would
>>>>>>>>>>> the
>>>>>>>>>>> credentials be used for. If you provide 10.222.18.113 in the
>>>>>>>>>>> authscope, the credentials would not be used for localhost even
>>>>>>>>>>> though
>>>>>>>>>>> both represent the same machine.
>>>>>>>>>>>
>>>>>>>>>>> Please provide logs with DEBUG enabled.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Susam Pal
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
>>>>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Susam,
>>>>>>>>>>>>
>>>>>>>>>>>> The ip 10.222.18.113 is nothing but the ip address of my
>>>>>>>>>>>> machine(localhost).
>>>>>>>>>>>> Now also i changed http://localhost:8080/ to
>>>>>>>>>>>> http://10.222.18.113:8080.
>>>>>>>>>>>> However no result, i mean to say still not able to crawl
>>>>>>>>>>>> password
>>>>>>>>>>>> protected
>>>>>>>>>>>> pages.
>>>>>>>>>>>>
>>>>>>>>>>>> Kindly assist me to resolve this issue.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance...
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Biswajit.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> The logs show that it is fetching http://localhost:8080/ but
>>>>>>>>>>>>> you
>>>>>>>>>>>>> have
>>>>>>>>>>>>> set credentials for 10.222.18.113:8080 which is never being
>>>>>>>>>>>>> fetched.
>>>>>>>>>>>>> So, no authentication takes place.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Susam Pal
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>>>>>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Susam,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In order to crawl password protected pages, I am using
>>>>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>> your
>>>>>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please find the attached zip file which contains
>>>>>>>>>>>>>> nutch-site.xml,httpclient-auth.xml.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Kindly provide me a solution for this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Biswajit
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Biswajit,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Could you please tell us how you have added the support for
>>>>>>>>>>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do
>>>>>>>>>>>>>>> authentication
>>>>>>>>>>>>>>> properly by default. The authentication feature is buggy in
>>>>>>>>>>>>>>> Nutch
>>>>>>>>>>>>>>> 0.9
>>>>>>>>>>>>>>> which was fixed with this ticket:
>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The feature is documented here:
>>>>>>>>>>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The easiest way to use it is to check out the latest version
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> Nutch
>>>>>>>>>>>>>>> and build it as it contains the authentication feature. If
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> use it with Nutch 0.9, you have to download the latest patch
>>>>>>>>>>>>>>> present
>>>>>>>>>>>>>>> in the ticket page and apply it to the source code and build
>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>> might have to resolve some conflicts manually.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would suggest that you do not send the mail same mail
>>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>> times. We have received the same mail from you 4 times. It
>>>>>>>>>>>>>>> takes
>>>>>>>>>>>>>>> sometime for members to reply to a mail. :-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Susam Pal
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>>>>>>>>>>>> <B1...@freescale.com> wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have successfully configured NUTCH 0.9, which is crawling
>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> sites
>>>>>>>>>>>>>>>> and after that searching is also happening properly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, now I want to crawl password protected pages using
>>>>>>>>>>>>>>>> NUTCH.
>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>> to access those pages I should have a valid user name and
>>>>>>>>>>>>>>>> password.
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> configured the user name and password in my nutch-site.xml
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> httpclient-auth.xml
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However it is not crawling.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and
>>>>>>>>>>>>>>>> hadoop.log
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> Zip file for your reference. Kindly check and let me know
>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>> from my end.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> CONFIGURATION:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Windows XP
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cygwin
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> jdk1.6.0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks in advance…
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please help....
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Biswajit
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> View this message in context:
>>>>>>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>>>>>>>>>>>> Sent from the Nutch - User mailing list archive at
>>>>>>>>>>>>>> Nabble.com.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>>>>>>>>>>>> --
>>>>>>>>>>>> View this message in context:
>>>>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>>>>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>  http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
>>>>>>>>>> debugenabled_hadoop.log
>>>>>>>>>>
>>>>>>>>> http://www.nabble.com/file/p19514374/latest.log latest.log
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>  http://www.nabble.com/file/p19552519/new.txt new.txt
>>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566502.html
>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19603477.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19663810.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by Susam Pal <su...@gmail.com>.
Replies inline.

On Mon, Sep 22, 2008 at 1:40 PM, biswajit_rout
<bi...@lntinfotech.com> wrote:
>
> Hi Susam,
>
> I saw,
> http://www.nabble.com/Problems-testing-Authentication-td13991771.html#a13995888
>
> Where your patch is used for web server (Tomcat) manager authentication.

My patch is used for Basic, Digest or NTLM authentication schemes
only. It works with Tomcat probably because it uses Basic
authentication to login to the manager. I hope you have read this:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes.

>
> However my requirement is different…
>
> I am trying to crawl sites which are password protected just like gmail.
> That means I have to pass correct user name and password, after that I will
> be able to see all the pages / modules. So for this, is there any different
> configuration mechanism is there?


POST based authentication is not done. You might want to read this:
http://wiki.apache.org/nutch/HttpPostAuthentication

Regards,
Susam Pal

>
> Could you please let me know???
>
> Best regards,
> Biswajit.
>
>
>
> Susam Pal wrote:
>>
>> Hi Biswajit,
>>
>> I don't find a single error caused due to authentication problem in
>> the 'new.txt' file you have attached in some mail before.. Most of
>> them are HTTP 404 or HTTP 302 errors, which means either the page is
>> not available or the page has been moved to another location, which
>> the crawler would try to fetch. There's nothing I can do to help you
>> in this matter. You have access to the network and you can analyze
>> better why this is happening. Please do not send the same mail
>> multiple time. As, I have told you before, it takes time for members
>> to respond as they do so only in their free time.
>>
>> Regards,
>> Susam Pal
>>
>> On Fri, Sep 19, 2008 at 5:38 AM, biswajit_rout
>> <bi...@lntinfotech.com> wrote:
>>>
>>> Hi Susam,
>>>
>>> Please give a look into the attached file (new.txt) and suggest a
>>> solution
>>> for this. This time i have crawled another site. I am able to crawl all
>>> the
>>> public pages but password protected pages crawling is not happening...
>>>
>>> Best regards,
>>> Biswajit.
>>>
>>>
>>> biswajit_rout wrote:
>>>>
>>>> Hi,
>>>>
>>>> There is nothing to crawl in the home page of
>>>> http://10.222.18.113:8080/dao/.
>>>>
>>>> So this time i have crawled another site. I have successfully crawled
>>>> all
>>>> the public pages but not able to crawl private pages.
>>>> I have attached a log file(new.log). Can you please check and let me
>>>> know
>>>> what needs to be done from my end???
>>>>
>>>> Best regards,
>>>> Biswajit.
>>>>
>>>>
>>>> Susam Pal wrote:
>>>>>
>>>>> The log file shows only one fetching line:
>>>>>
>>>>> 2008-09-16 20:46:15,321 INFO  fetcher.Fetcher - fetching
>>>>> http://10.222.18.113:8080/dao/
>>>>>
>>>>> This has been fetched successfully. There is no other page being
>>>>> fetched. Have you set up Nutch properly so that it can fetch all the
>>>>> pages you need? If it tries to fetch a page but fails due to
>>>>> authentication, then it is a problem with authentication.
>>>>>
>>>>> In this case, it is not even attempting to fetch those pages. So, the
>>>>> problem lies elsewhere. You need to first find out why it is fetching
>>>>> only one page and not others.
>>>>>
>>>>> Regards,
>>>>> Susam Pal
>>>>>
>>>>> On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>
>>>>>> But still it is not crawling the password protected pages...
>>>>>>
>>>>>> Regards,
>>>>>> Biswajit.
>>>>>>
>>>>>>
>>>>>> Susam Pal wrote:
>>>>>>>
>>>>>>> The latest log shows that the page from the URL:
>>>>>>> http://10.222.18.113:8080/dao/ has been fetched successfully.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Susam Pal
>>>>>>>
>>>>>>> On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>
>>>>>>>> Hi Susam,
>>>>>>>>
>>>>>>>> Please find the latest log file(latest.log), which shows different
>>>>>>>> error.
>>>>>>>>
>>>>>>>> 2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
>>>>>>>> http://10.222.18.113:8080/robots.txt; status code: 404; bytes
>>>>>>>> received:
>>>>>>>> 985;
>>>>>>>> Content-Length: 985
>>>>>>>> 2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
>>>>>>>> http://10.222.18.113:8080/dao/; status code: 200; bytes received:
>>>>>>>> 1941;
>>>>>>>> Content-Length: 1941
>>>>>>>>
>>>>>>>> Thanks in advance...
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Biswajit.
>>>>>>>>
>>>>>>>>
>>>>>>>> biswajit_rout wrote:
>>>>>>>>>
>>>>>>>>> Hi Susam,
>>>>>>>>>
>>>>>>>>> Thanks for your immediate response...
>>>>>>>>> Herewith i am attaching the debug enabled log
>>>>>>>>> file(debugenabled_hadoop.log). Kindly go through the file and let
>>>>>>>>> me
>>>>>>>>> know
>>>>>>>>> what is missing from my end...
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Biswajit.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Biswajit,
>>>>>>>>>>
>>>>>>>>>> The authscope specifies which IP address or domain-name would the
>>>>>>>>>> credentials be used for. If you provide 10.222.18.113 in the
>>>>>>>>>> authscope, the credentials would not be used for localhost even
>>>>>>>>>> though
>>>>>>>>>> both represent the same machine.
>>>>>>>>>>
>>>>>>>>>> Please provide logs with DEBUG enabled.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Susam Pal
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
>>>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Susam,
>>>>>>>>>>>
>>>>>>>>>>> The ip 10.222.18.113 is nothing but the ip address of my
>>>>>>>>>>> machine(localhost).
>>>>>>>>>>> Now also i changed http://localhost:8080/ to
>>>>>>>>>>> http://10.222.18.113:8080.
>>>>>>>>>>> However no result, i mean to say still not able to crawl password
>>>>>>>>>>> protected
>>>>>>>>>>> pages.
>>>>>>>>>>>
>>>>>>>>>>> Kindly assist me to resolve this issue.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance...
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Biswajit.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> The logs show that it is fetching http://localhost:8080/ but you
>>>>>>>>>>>> have
>>>>>>>>>>>> set credentials for 10.222.18.113:8080 which is never being
>>>>>>>>>>>> fetched.
>>>>>>>>>>>> So, no authentication takes place.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Susam Pal
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>>>>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Susam,
>>>>>>>>>>>>>
>>>>>>>>>>>>> In order to crawl password protected pages, I am using
>>>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>>>>> contains
>>>>>>>>>>>>> your
>>>>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please find the attached zip file which contains
>>>>>>>>>>>>> nutch-site.xml,httpclient-auth.xml.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kindly provide me a solution for this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Biswajit
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Biswajit,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Could you please tell us how you have added the support for
>>>>>>>>>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do
>>>>>>>>>>>>>> authentication
>>>>>>>>>>>>>> properly by default. The authentication feature is buggy in
>>>>>>>>>>>>>> Nutch
>>>>>>>>>>>>>> 0.9
>>>>>>>>>>>>>> which was fixed with this ticket:
>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The feature is documented here:
>>>>>>>>>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The easiest way to use it is to check out the latest version
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> Nutch
>>>>>>>>>>>>>> and build it as it contains the authentication feature. If you
>>>>>>>>>>>>>> want
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> use it with Nutch 0.9, you have to download the latest patch
>>>>>>>>>>>>>> present
>>>>>>>>>>>>>> in the ticket page and apply it to the source code and build
>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>> You
>>>>>>>>>>>>>> might have to resolve some conflicts manually.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would suggest that you do not send the mail same mail
>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>> times. We have received the same mail from you 4 times. It
>>>>>>>>>>>>>> takes
>>>>>>>>>>>>>> sometime for members to reply to a mail. :-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Susam Pal
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>>>>>>>>>>> <B1...@freescale.com> wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have successfully configured NUTCH 0.9, which is crawling
>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> sites
>>>>>>>>>>>>>>> and after that searching is also happening properly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, now I want to crawl password protected pages using
>>>>>>>>>>>>>>> NUTCH.
>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>> to access those pages I should have a valid user name and
>>>>>>>>>>>>>>> password.
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> configured the user name and password in my nutch-site.xml
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> httpclient-auth.xml
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However it is not crawling.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and
>>>>>>>>>>>>>>> hadoop.log
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> Zip file for your reference. Kindly check and let me know
>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>> from my end.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> CONFIGURATION:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Windows XP
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cygwin
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> jdk1.6.0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks in advance…
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please help....
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Biswajit
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>>>>>>>>>>> --
>>>>>>>>>>>>> View this message in context:
>>>>>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>>>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context:
>>>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>>>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>  http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
>>>>>>>>> debugenabled_hadoop.log
>>>>>>>>>
>>>>>>>> http://www.nabble.com/file/p19514374/latest.log latest.log
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>  http://www.nabble.com/file/p19552519/new.txt new.txt
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566502.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19603477.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by biswajit_rout <bi...@lntinfotech.com>.
Hi Susam,

I saw,
http://www.nabble.com/Problems-testing-Authentication-td13991771.html#a13995888

Where your patch is used for web server (Tomcat) manager authentication.

However my requirement is different…

I am trying to crawl sites which are password protected just like gmail.
That means I have to pass correct user name and password, after that I will
be able to see all the pages / modules. So for this, is there any different
configuration mechanism is there?

Could you please let me know???

Best regards,
Biswajit.



Susam Pal wrote:
> 
> Hi Biswajit,
> 
> I don't find a single error caused due to authentication problem in
> the 'new.txt' file you have attached in some mail before.. Most of
> them are HTTP 404 or HTTP 302 errors, which means either the page is
> not available or the page has been moved to another location, which
> the crawler would try to fetch. There's nothing I can do to help you
> in this matter. You have access to the network and you can analyze
> better why this is happening. Please do not send the same mail
> multiple time. As, I have told you before, it takes time for members
> to respond as they do so only in their free time.
> 
> Regards,
> Susam Pal
> 
> On Fri, Sep 19, 2008 at 5:38 AM, biswajit_rout
> <bi...@lntinfotech.com> wrote:
>>
>> Hi Susam,
>>
>> Please give a look into the attached file (new.txt) and suggest a
>> solution
>> for this. This time i have crawled another site. I am able to crawl all
>> the
>> public pages but password protected pages crawling is not happening...
>>
>> Best regards,
>> Biswajit.
>>
>>
>> biswajit_rout wrote:
>>>
>>> Hi,
>>>
>>> There is nothing to crawl in the home page of
>>> http://10.222.18.113:8080/dao/.
>>>
>>> So this time i have crawled another site. I have successfully crawled
>>> all
>>> the public pages but not able to crawl private pages.
>>> I have attached a log file(new.log). Can you please check and let me
>>> know
>>> what needs to be done from my end???
>>>
>>> Best regards,
>>> Biswajit.
>>>
>>>
>>> Susam Pal wrote:
>>>>
>>>> The log file shows only one fetching line:
>>>>
>>>> 2008-09-16 20:46:15,321 INFO  fetcher.Fetcher - fetching
>>>> http://10.222.18.113:8080/dao/
>>>>
>>>> This has been fetched successfully. There is no other page being
>>>> fetched. Have you set up Nutch properly so that it can fetch all the
>>>> pages you need? If it tries to fetch a page but fails due to
>>>> authentication, then it is a problem with authentication.
>>>>
>>>> In this case, it is not even attempting to fetch those pages. So, the
>>>> problem lies elsewhere. You need to first find out why it is fetching
>>>> only one page and not others.
>>>>
>>>> Regards,
>>>> Susam Pal
>>>>
>>>> On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
>>>> <bi...@lntinfotech.com> wrote:
>>>>>
>>>>> But still it is not crawling the password protected pages...
>>>>>
>>>>> Regards,
>>>>> Biswajit.
>>>>>
>>>>>
>>>>> Susam Pal wrote:
>>>>>>
>>>>>> The latest log shows that the page from the URL:
>>>>>> http://10.222.18.113:8080/dao/ has been fetched successfully.
>>>>>>
>>>>>> Regards,
>>>>>> Susam Pal
>>>>>>
>>>>>> On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>
>>>>>>> Hi Susam,
>>>>>>>
>>>>>>> Please find the latest log file(latest.log), which shows different
>>>>>>> error.
>>>>>>>
>>>>>>> 2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
>>>>>>> http://10.222.18.113:8080/robots.txt; status code: 404; bytes
>>>>>>> received:
>>>>>>> 985;
>>>>>>> Content-Length: 985
>>>>>>> 2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
>>>>>>> http://10.222.18.113:8080/dao/; status code: 200; bytes received:
>>>>>>> 1941;
>>>>>>> Content-Length: 1941
>>>>>>>
>>>>>>> Thanks in advance...
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Biswajit.
>>>>>>>
>>>>>>>
>>>>>>> biswajit_rout wrote:
>>>>>>>>
>>>>>>>> Hi Susam,
>>>>>>>>
>>>>>>>> Thanks for your immediate response...
>>>>>>>> Herewith i am attaching the debug enabled log
>>>>>>>> file(debugenabled_hadoop.log). Kindly go through the file and let
>>>>>>>> me
>>>>>>>> know
>>>>>>>> what is missing from my end...
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Biswajit.
>>>>>>>>
>>>>>>>>
>>>>>>>> Susam Pal wrote:
>>>>>>>>>
>>>>>>>>> Hi Biswajit,
>>>>>>>>>
>>>>>>>>> The authscope specifies which IP address or domain-name would the
>>>>>>>>> credentials be used for. If you provide 10.222.18.113 in the
>>>>>>>>> authscope, the credentials would not be used for localhost even
>>>>>>>>> though
>>>>>>>>> both represent the same machine.
>>>>>>>>>
>>>>>>>>> Please provide logs with DEBUG enabled.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Susam Pal
>>>>>>>>>
>>>>>>>>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
>>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Susam,
>>>>>>>>>>
>>>>>>>>>> The ip 10.222.18.113 is nothing but the ip address of my
>>>>>>>>>> machine(localhost).
>>>>>>>>>> Now also i changed http://localhost:8080/ to
>>>>>>>>>> http://10.222.18.113:8080.
>>>>>>>>>> However no result, i mean to say still not able to crawl password
>>>>>>>>>> protected
>>>>>>>>>> pages.
>>>>>>>>>>
>>>>>>>>>> Kindly assist me to resolve this issue.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance...
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Biswajit.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>>
>>>>>>>>>>> The logs show that it is fetching http://localhost:8080/ but you
>>>>>>>>>>> have
>>>>>>>>>>> set credentials for 10.222.18.113:8080 which is never being
>>>>>>>>>>> fetched.
>>>>>>>>>>> So, no authentication takes place.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Susam Pal
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>>>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Susam,
>>>>>>>>>>>>
>>>>>>>>>>>> In order to crawl password protected pages, I am using
>>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>>>> contains
>>>>>>>>>>>> your
>>>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>>>
>>>>>>>>>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>>>>>>>>>
>>>>>>>>>>>> Please find the attached zip file which contains
>>>>>>>>>>>> nutch-site.xml,httpclient-auth.xml.
>>>>>>>>>>>>
>>>>>>>>>>>> Kindly provide me a solution for this.
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Biswajit
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Biswajit,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Could you please tell us how you have added the support for
>>>>>>>>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do
>>>>>>>>>>>>> authentication
>>>>>>>>>>>>> properly by default. The authentication feature is buggy in
>>>>>>>>>>>>> Nutch
>>>>>>>>>>>>> 0.9
>>>>>>>>>>>>> which was fixed with this ticket:
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>>>>>>>>>
>>>>>>>>>>>>> The feature is documented here:
>>>>>>>>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>>>>>>>>>
>>>>>>>>>>>>> The easiest way to use it is to check out the latest version
>>>>>>>>>>>>> of
>>>>>>>>>>>>> Nutch
>>>>>>>>>>>>> and build it as it contains the authentication feature. If you
>>>>>>>>>>>>> want
>>>>>>>>>>>>> to
>>>>>>>>>>>>> use it with Nutch 0.9, you have to download the latest patch
>>>>>>>>>>>>> present
>>>>>>>>>>>>> in the ticket page and apply it to the source code and build
>>>>>>>>>>>>> it.
>>>>>>>>>>>>> You
>>>>>>>>>>>>> might have to resolve some conflicts manually.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would suggest that you do not send the mail same mail
>>>>>>>>>>>>> multiple
>>>>>>>>>>>>> times. We have received the same mail from you 4 times. It
>>>>>>>>>>>>> takes
>>>>>>>>>>>>> sometime for members to reply to a mail. :-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Susam Pal
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>>>>>>>>>> <B1...@freescale.com> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have successfully configured NUTCH 0.9, which is crawling
>>>>>>>>>>>>>> number
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> sites
>>>>>>>>>>>>>> and after that searching is also happening properly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, now I want to crawl password protected pages using
>>>>>>>>>>>>>> NUTCH.
>>>>>>>>>>>>>> In
>>>>>>>>>>>>>> order
>>>>>>>>>>>>>> to access those pages I should have a valid user name and
>>>>>>>>>>>>>> password.
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>> configured the user name and password in my nutch-site.xml
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> httpclient-auth.xml
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However it is not crawling.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and
>>>>>>>>>>>>>> hadoop.log
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> Zip file for your reference. Kindly check and let me know
>>>>>>>>>>>>>> what
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>> from my end.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CONFIGURATION:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>> your
>>>>>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Windows XP
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cygwin
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> jdk1.6.0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks in advance…
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help....
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Biswajit
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>>>>>>>>>> --
>>>>>>>>>>>> View this message in context:
>>>>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>>>>>>>>>> --
>>>>>>>>>> View this message in context:
>>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>  http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
>>>>>>>> debugenabled_hadoop.log
>>>>>>>>
>>>>>>> http://www.nabble.com/file/p19514374/latest.log latest.log
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>>
>>>  http://www.nabble.com/file/p19552519/new.txt new.txt
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566502.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19603477.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by Susam Pal <su...@gmail.com>.
Hi Biswajit,

I don't find a single error caused due to authentication problem in
the 'new.txt' file you have attached in some mail before.. Most of
them are HTTP 404 or HTTP 302 errors, which means either the page is
not available or the page has been moved to another location, which
the crawler would try to fetch. There's nothing I can do to help you
in this matter. You have access to the network and you can analyze
better why this is happening. Please do not send the same mail
multiple time. As, I have told you before, it takes time for members
to respond as they do so only in their free time.

Regards,
Susam Pal

On Fri, Sep 19, 2008 at 5:38 AM, biswajit_rout
<bi...@lntinfotech.com> wrote:
>
> Hi Susam,
>
> Please give a look into the attached file (new.txt) and suggest a solution
> for this. This time i have crawled another site. I am able to crawl all the
> public pages but password protected pages crawling is not happening...
>
> Best regards,
> Biswajit.
>
>
> biswajit_rout wrote:
>>
>> Hi,
>>
>> There is nothing to crawl in the home page of
>> http://10.222.18.113:8080/dao/.
>>
>> So this time i have crawled another site. I have successfully crawled all
>> the public pages but not able to crawl private pages.
>> I have attached a log file(new.log). Can you please check and let me know
>> what needs to be done from my end???
>>
>> Best regards,
>> Biswajit.
>>
>>
>> Susam Pal wrote:
>>>
>>> The log file shows only one fetching line:
>>>
>>> 2008-09-16 20:46:15,321 INFO  fetcher.Fetcher - fetching
>>> http://10.222.18.113:8080/dao/
>>>
>>> This has been fetched successfully. There is no other page being
>>> fetched. Have you set up Nutch properly so that it can fetch all the
>>> pages you need? If it tries to fetch a page but fails due to
>>> authentication, then it is a problem with authentication.
>>>
>>> In this case, it is not even attempting to fetch those pages. So, the
>>> problem lies elsewhere. You need to first find out why it is fetching
>>> only one page and not others.
>>>
>>> Regards,
>>> Susam Pal
>>>
>>> On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
>>> <bi...@lntinfotech.com> wrote:
>>>>
>>>> But still it is not crawling the password protected pages...
>>>>
>>>> Regards,
>>>> Biswajit.
>>>>
>>>>
>>>> Susam Pal wrote:
>>>>>
>>>>> The latest log shows that the page from the URL:
>>>>> http://10.222.18.113:8080/dao/ has been fetched successfully.
>>>>>
>>>>> Regards,
>>>>> Susam Pal
>>>>>
>>>>> On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>
>>>>>> Hi Susam,
>>>>>>
>>>>>> Please find the latest log file(latest.log), which shows different
>>>>>> error.
>>>>>>
>>>>>> 2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
>>>>>> http://10.222.18.113:8080/robots.txt; status code: 404; bytes
>>>>>> received:
>>>>>> 985;
>>>>>> Content-Length: 985
>>>>>> 2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
>>>>>> http://10.222.18.113:8080/dao/; status code: 200; bytes received:
>>>>>> 1941;
>>>>>> Content-Length: 1941
>>>>>>
>>>>>> Thanks in advance...
>>>>>>
>>>>>> Best regards,
>>>>>> Biswajit.
>>>>>>
>>>>>>
>>>>>> biswajit_rout wrote:
>>>>>>>
>>>>>>> Hi Susam,
>>>>>>>
>>>>>>> Thanks for your immediate response...
>>>>>>> Herewith i am attaching the debug enabled log
>>>>>>> file(debugenabled_hadoop.log). Kindly go through the file and let me
>>>>>>> know
>>>>>>> what is missing from my end...
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Biswajit.
>>>>>>>
>>>>>>>
>>>>>>> Susam Pal wrote:
>>>>>>>>
>>>>>>>> Hi Biswajit,
>>>>>>>>
>>>>>>>> The authscope specifies which IP address or domain-name would the
>>>>>>>> credentials be used for. If you provide 10.222.18.113 in the
>>>>>>>> authscope, the credentials would not be used for localhost even
>>>>>>>> though
>>>>>>>> both represent the same machine.
>>>>>>>>
>>>>>>>> Please provide logs with DEBUG enabled.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Susam Pal
>>>>>>>>
>>>>>>>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Susam,
>>>>>>>>>
>>>>>>>>> The ip 10.222.18.113 is nothing but the ip address of my
>>>>>>>>> machine(localhost).
>>>>>>>>> Now also i changed http://localhost:8080/ to
>>>>>>>>> http://10.222.18.113:8080.
>>>>>>>>> However no result, i mean to say still not able to crawl password
>>>>>>>>> protected
>>>>>>>>> pages.
>>>>>>>>>
>>>>>>>>> Kindly assist me to resolve this issue.
>>>>>>>>>
>>>>>>>>> Thanks in advance...
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Biswajit.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>
>>>>>>>>>> The logs show that it is fetching http://localhost:8080/ but you
>>>>>>>>>> have
>>>>>>>>>> set credentials for 10.222.18.113:8080 which is never being
>>>>>>>>>> fetched.
>>>>>>>>>> So, no authentication takes place.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Susam Pal
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Susam,
>>>>>>>>>>>
>>>>>>>>>>> In order to crawl password protected pages, I am using
>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>>> contains
>>>>>>>>>>> your
>>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>>
>>>>>>>>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>>>>>>>>
>>>>>>>>>>> Please find the attached zip file which contains
>>>>>>>>>>> nutch-site.xml,httpclient-auth.xml.
>>>>>>>>>>>
>>>>>>>>>>> Kindly provide me a solution for this.
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Biswajit
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Biswajit,
>>>>>>>>>>>>
>>>>>>>>>>>> Could you please tell us how you have added the support for
>>>>>>>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>>>>>>>>>>> properly by default. The authentication feature is buggy in
>>>>>>>>>>>> Nutch
>>>>>>>>>>>> 0.9
>>>>>>>>>>>> which was fixed with this ticket:
>>>>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>>>>>>>>
>>>>>>>>>>>> The feature is documented here:
>>>>>>>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>>>>>>>>
>>>>>>>>>>>> The easiest way to use it is to check out the latest version of
>>>>>>>>>>>> Nutch
>>>>>>>>>>>> and build it as it contains the authentication feature. If you
>>>>>>>>>>>> want
>>>>>>>>>>>> to
>>>>>>>>>>>> use it with Nutch 0.9, you have to download the latest patch
>>>>>>>>>>>> present
>>>>>>>>>>>> in the ticket page and apply it to the source code and build it.
>>>>>>>>>>>> You
>>>>>>>>>>>> might have to resolve some conflicts manually.
>>>>>>>>>>>>
>>>>>>>>>>>> I would suggest that you do not send the mail same mail multiple
>>>>>>>>>>>> times. We have received the same mail from you 4 times. It takes
>>>>>>>>>>>> sometime for members to reply to a mail. :-)
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Susam Pal
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>>>>>>>>> <B1...@freescale.com> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have successfully configured NUTCH 0.9, which is crawling
>>>>>>>>>>>>> number
>>>>>>>>>>>>> of
>>>>>>>>>>>>> sites
>>>>>>>>>>>>> and after that searching is also happening properly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, now I want to crawl password protected pages using
>>>>>>>>>>>>> NUTCH.
>>>>>>>>>>>>> In
>>>>>>>>>>>>> order
>>>>>>>>>>>>> to access those pages I should have a valid user name and
>>>>>>>>>>>>> password.
>>>>>>>>>>>>> I
>>>>>>>>>>>>> have
>>>>>>>>>>>>> configured the user name and password in my nutch-site.xml and
>>>>>>>>>>>>> httpclient-auth.xml
>>>>>>>>>>>>>
>>>>>>>>>>>>> However it is not crawling.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and
>>>>>>>>>>>>> hadoop.log
>>>>>>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>> Zip file for your reference. Kindly check and let me know what
>>>>>>>>>>>>> is
>>>>>>>>>>>>> missing
>>>>>>>>>>>>> from my end.
>>>>>>>>>>>>>
>>>>>>>>>>>>> CONFIGURATION:
>>>>>>>>>>>>>
>>>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>>>>> contains
>>>>>>>>>>>>> your
>>>>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Windows XP
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cygwin
>>>>>>>>>>>>>
>>>>>>>>>>>>> jdk1.6.0
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance…
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please help....
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Biswajit
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context:
>>>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>  http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
>>>>>>> debugenabled_hadoop.log
>>>>>>>
>>>>>> http://www.nabble.com/file/p19514374/latest.log latest.log
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>  http://www.nabble.com/file/p19552519/new.txt new.txt
>>
>
> --
> View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566502.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by biswajit_rout <bi...@lntinfotech.com>.
Hi Susam,

Please give a look into the attached file (new.txt) and suggest a solution
for this. This time i have crawled another site. I am able to crawl all the
public pages but password protected pages crawling is not happening...

Best regards,
Biswajit.


biswajit_rout wrote:
> 
> Hi,
> 
> There is nothing to crawl in the home page of
> http://10.222.18.113:8080/dao/.
> 
> So this time i have crawled another site. I have successfully crawled all
> the public pages but not able to crawl private pages.
> I have attached a log file(new.log). Can you please check and let me know
> what needs to be done from my end???
> 
> Best regards,
> Biswajit.
> 
> 
> Susam Pal wrote:
>> 
>> The log file shows only one fetching line:
>> 
>> 2008-09-16 20:46:15,321 INFO  fetcher.Fetcher - fetching
>> http://10.222.18.113:8080/dao/
>> 
>> This has been fetched successfully. There is no other page being
>> fetched. Have you set up Nutch properly so that it can fetch all the
>> pages you need? If it tries to fetch a page but fails due to
>> authentication, then it is a problem with authentication.
>> 
>> In this case, it is not even attempting to fetch those pages. So, the
>> problem lies elsewhere. You need to first find out why it is fetching
>> only one page and not others.
>> 
>> Regards,
>> Susam Pal
>> 
>> On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
>> <bi...@lntinfotech.com> wrote:
>>>
>>> But still it is not crawling the password protected pages...
>>>
>>> Regards,
>>> Biswajit.
>>>
>>>
>>> Susam Pal wrote:
>>>>
>>>> The latest log shows that the page from the URL:
>>>> http://10.222.18.113:8080/dao/ has been fetched successfully.
>>>>
>>>> Regards,
>>>> Susam Pal
>>>>
>>>> On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
>>>> <bi...@lntinfotech.com> wrote:
>>>>>
>>>>> Hi Susam,
>>>>>
>>>>> Please find the latest log file(latest.log), which shows different
>>>>> error.
>>>>>
>>>>> 2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
>>>>> http://10.222.18.113:8080/robots.txt; status code: 404; bytes
>>>>> received:
>>>>> 985;
>>>>> Content-Length: 985
>>>>> 2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
>>>>> http://10.222.18.113:8080/dao/; status code: 200; bytes received:
>>>>> 1941;
>>>>> Content-Length: 1941
>>>>>
>>>>> Thanks in advance...
>>>>>
>>>>> Best regards,
>>>>> Biswajit.
>>>>>
>>>>>
>>>>> biswajit_rout wrote:
>>>>>>
>>>>>> Hi Susam,
>>>>>>
>>>>>> Thanks for your immediate response...
>>>>>> Herewith i am attaching the debug enabled log
>>>>>> file(debugenabled_hadoop.log). Kindly go through the file and let me
>>>>>> know
>>>>>> what is missing from my end...
>>>>>>
>>>>>> Best regards,
>>>>>> Biswajit.
>>>>>>
>>>>>>
>>>>>> Susam Pal wrote:
>>>>>>>
>>>>>>> Hi Biswajit,
>>>>>>>
>>>>>>> The authscope specifies which IP address or domain-name would the
>>>>>>> credentials be used for. If you provide 10.222.18.113 in the
>>>>>>> authscope, the credentials would not be used for localhost even
>>>>>>> though
>>>>>>> both represent the same machine.
>>>>>>>
>>>>>>> Please provide logs with DEBUG enabled.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Susam Pal
>>>>>>>
>>>>>>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>
>>>>>>>> Hi Susam,
>>>>>>>>
>>>>>>>> The ip 10.222.18.113 is nothing but the ip address of my
>>>>>>>> machine(localhost).
>>>>>>>> Now also i changed http://localhost:8080/ to
>>>>>>>> http://10.222.18.113:8080.
>>>>>>>> However no result, i mean to say still not able to crawl password
>>>>>>>> protected
>>>>>>>> pages.
>>>>>>>>
>>>>>>>> Kindly assist me to resolve this issue.
>>>>>>>>
>>>>>>>> Thanks in advance...
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Biswajit.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Susam Pal wrote:
>>>>>>>>>
>>>>>>>>> The logs show that it is fetching http://localhost:8080/ but you
>>>>>>>>> have
>>>>>>>>> set credentials for 10.222.18.113:8080 which is never being
>>>>>>>>> fetched.
>>>>>>>>> So, no authentication takes place.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Susam Pal
>>>>>>>>>
>>>>>>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Susam,
>>>>>>>>>>
>>>>>>>>>> In order to crawl password protected pages, I am using
>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>> contains
>>>>>>>>>> your
>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>
>>>>>>>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>>>>>>>
>>>>>>>>>> Please find the attached zip file which contains
>>>>>>>>>> nutch-site.xml,httpclient-auth.xml.
>>>>>>>>>>
>>>>>>>>>> Kindly provide me a solution for this.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Biswajit
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Biswajit,
>>>>>>>>>>>
>>>>>>>>>>> Could you please tell us how you have added the support for
>>>>>>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>>>>>>>>>> properly by default. The authentication feature is buggy in
>>>>>>>>>>> Nutch
>>>>>>>>>>> 0.9
>>>>>>>>>>> which was fixed with this ticket:
>>>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>>>>>>>
>>>>>>>>>>> The feature is documented here:
>>>>>>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>>>>>>>
>>>>>>>>>>> The easiest way to use it is to check out the latest version of
>>>>>>>>>>> Nutch
>>>>>>>>>>> and build it as it contains the authentication feature. If you
>>>>>>>>>>> want
>>>>>>>>>>> to
>>>>>>>>>>> use it with Nutch 0.9, you have to download the latest patch
>>>>>>>>>>> present
>>>>>>>>>>> in the ticket page and apply it to the source code and build it.
>>>>>>>>>>> You
>>>>>>>>>>> might have to resolve some conflicts manually.
>>>>>>>>>>>
>>>>>>>>>>> I would suggest that you do not send the mail same mail multiple
>>>>>>>>>>> times. We have received the same mail from you 4 times. It takes
>>>>>>>>>>> sometime for members to reply to a mail. :-)
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Susam Pal
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>>>>>>>> <B1...@freescale.com> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I have successfully configured NUTCH 0.9, which is crawling
>>>>>>>>>>>> number
>>>>>>>>>>>> of
>>>>>>>>>>>> sites
>>>>>>>>>>>> and after that searching is also happening properly.
>>>>>>>>>>>>
>>>>>>>>>>>> However, now I want to crawl password protected pages using
>>>>>>>>>>>> NUTCH.
>>>>>>>>>>>> In
>>>>>>>>>>>> order
>>>>>>>>>>>> to access those pages I should have a valid user name and
>>>>>>>>>>>> password.
>>>>>>>>>>>> I
>>>>>>>>>>>> have
>>>>>>>>>>>> configured the user name and password in my nutch-site.xml and
>>>>>>>>>>>> httpclient-auth.xml
>>>>>>>>>>>>
>>>>>>>>>>>> However it is not crawling.
>>>>>>>>>>>>
>>>>>>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and
>>>>>>>>>>>> hadoop.log
>>>>>>>>>>>> in
>>>>>>>>>>>> the
>>>>>>>>>>>> Zip file for your reference. Kindly check and let me know what
>>>>>>>>>>>> is
>>>>>>>>>>>> missing
>>>>>>>>>>>> from my end.
>>>>>>>>>>>>
>>>>>>>>>>>> CONFIGURATION:
>>>>>>>>>>>>
>>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>>>> contains
>>>>>>>>>>>> your
>>>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Windows XP
>>>>>>>>>>>>
>>>>>>>>>>>> Cygwin
>>>>>>>>>>>>
>>>>>>>>>>>> jdk1.6.0
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance…
>>>>>>>>>>>>
>>>>>>>>>>>> Please help....
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Biswajit
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>>>>>>>> --
>>>>>>>>>> View this message in context:
>>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>  http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
>>>>>> debugenabled_hadoop.log
>>>>>>
>>>>> http://www.nabble.com/file/p19514374/latest.log latest.log
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
>  http://www.nabble.com/file/p19552519/new.txt new.txt 
> 

-- 
View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566502.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by biswajit_rout <bi...@lntinfotech.com>.
Hi,

There is nothing to crawl in the home page of
http://10.222.18.113:8080/dao/.

So this time i have crawled another site. I have successfully crawled all
the public pages but not able to crawl private pages.
I have attached a log file(new.log). Can you please check and let me know
what needs to be done from my end???

Best regards,
Biswajit.


Susam Pal wrote:
> 
> The log file shows only one fetching line:
> 
> 2008-09-16 20:46:15,321 INFO  fetcher.Fetcher - fetching
> http://10.222.18.113:8080/dao/
> 
> This has been fetched successfully. There is no other page being
> fetched. Have you set up Nutch properly so that it can fetch all the
> pages you need? If it tries to fetch a page but fails due to
> authentication, then it is a problem with authentication.
> 
> In this case, it is not even attempting to fetch those pages. So, the
> problem lies elsewhere. You need to first find out why it is fetching
> only one page and not others.
> 
> Regards,
> Susam Pal
> 
> On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
> <bi...@lntinfotech.com> wrote:
>>
>> But still it is not crawling the password protected pages...
>>
>> Regards,
>> Biswajit.
>>
>>
>> Susam Pal wrote:
>>>
>>> The latest log shows that the page from the URL:
>>> http://10.222.18.113:8080/dao/ has been fetched successfully.
>>>
>>> Regards,
>>> Susam Pal
>>>
>>> On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
>>> <bi...@lntinfotech.com> wrote:
>>>>
>>>> Hi Susam,
>>>>
>>>> Please find the latest log file(latest.log), which shows different
>>>> error.
>>>>
>>>> 2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
>>>> http://10.222.18.113:8080/robots.txt; status code: 404; bytes received:
>>>> 985;
>>>> Content-Length: 985
>>>> 2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
>>>> http://10.222.18.113:8080/dao/; status code: 200; bytes received: 1941;
>>>> Content-Length: 1941
>>>>
>>>> Thanks in advance...
>>>>
>>>> Best regards,
>>>> Biswajit.
>>>>
>>>>
>>>> biswajit_rout wrote:
>>>>>
>>>>> Hi Susam,
>>>>>
>>>>> Thanks for your immediate response...
>>>>> Herewith i am attaching the debug enabled log
>>>>> file(debugenabled_hadoop.log). Kindly go through the file and let me
>>>>> know
>>>>> what is missing from my end...
>>>>>
>>>>> Best regards,
>>>>> Biswajit.
>>>>>
>>>>>
>>>>> Susam Pal wrote:
>>>>>>
>>>>>> Hi Biswajit,
>>>>>>
>>>>>> The authscope specifies which IP address or domain-name would the
>>>>>> credentials be used for. If you provide 10.222.18.113 in the
>>>>>> authscope, the credentials would not be used for localhost even
>>>>>> though
>>>>>> both represent the same machine.
>>>>>>
>>>>>> Please provide logs with DEBUG enabled.
>>>>>>
>>>>>> Regards,
>>>>>> Susam Pal
>>>>>>
>>>>>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>
>>>>>>> Hi Susam,
>>>>>>>
>>>>>>> The ip 10.222.18.113 is nothing but the ip address of my
>>>>>>> machine(localhost).
>>>>>>> Now also i changed http://localhost:8080/ to
>>>>>>> http://10.222.18.113:8080.
>>>>>>> However no result, i mean to say still not able to crawl password
>>>>>>> protected
>>>>>>> pages.
>>>>>>>
>>>>>>> Kindly assist me to resolve this issue.
>>>>>>>
>>>>>>> Thanks in advance...
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Biswajit.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Susam Pal wrote:
>>>>>>>>
>>>>>>>> The logs show that it is fetching http://localhost:8080/ but you
>>>>>>>> have
>>>>>>>> set credentials for 10.222.18.113:8080 which is never being
>>>>>>>> fetched.
>>>>>>>> So, no authentication takes place.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Susam Pal
>>>>>>>>
>>>>>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Susam,
>>>>>>>>>
>>>>>>>>> In order to crawl password protected pages, I am using
>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>> contains
>>>>>>>>> your
>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>
>>>>>>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>>>>>>
>>>>>>>>> Please find the attached zip file which contains
>>>>>>>>> nutch-site.xml,httpclient-auth.xml.
>>>>>>>>>
>>>>>>>>> Kindly provide me a solution for this.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Biswajit
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Susam Pal wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Biswajit,
>>>>>>>>>>
>>>>>>>>>> Could you please tell us how you have added the support for
>>>>>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>>>>>>>>> properly by default. The authentication feature is buggy in Nutch
>>>>>>>>>> 0.9
>>>>>>>>>> which was fixed with this ticket:
>>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>>>>>>
>>>>>>>>>> The feature is documented here:
>>>>>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>>>>>>
>>>>>>>>>> The easiest way to use it is to check out the latest version of
>>>>>>>>>> Nutch
>>>>>>>>>> and build it as it contains the authentication feature. If you
>>>>>>>>>> want
>>>>>>>>>> to
>>>>>>>>>> use it with Nutch 0.9, you have to download the latest patch
>>>>>>>>>> present
>>>>>>>>>> in the ticket page and apply it to the source code and build it.
>>>>>>>>>> You
>>>>>>>>>> might have to resolve some conflicts manually.
>>>>>>>>>>
>>>>>>>>>> I would suggest that you do not send the mail same mail multiple
>>>>>>>>>> times. We have received the same mail from you 4 times. It takes
>>>>>>>>>> sometime for members to reply to a mail. :-)
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Susam Pal
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>>>>>>> <B1...@freescale.com> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I have successfully configured NUTCH 0.9, which is crawling
>>>>>>>>>>> number
>>>>>>>>>>> of
>>>>>>>>>>> sites
>>>>>>>>>>> and after that searching is also happening properly.
>>>>>>>>>>>
>>>>>>>>>>> However, now I want to crawl password protected pages using
>>>>>>>>>>> NUTCH.
>>>>>>>>>>> In
>>>>>>>>>>> order
>>>>>>>>>>> to access those pages I should have a valid user name and
>>>>>>>>>>> password.
>>>>>>>>>>> I
>>>>>>>>>>> have
>>>>>>>>>>> configured the user name and password in my nutch-site.xml and
>>>>>>>>>>> httpclient-auth.xml
>>>>>>>>>>>
>>>>>>>>>>> However it is not crawling.
>>>>>>>>>>>
>>>>>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and
>>>>>>>>>>> hadoop.log
>>>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>> Zip file for your reference. Kindly check and let me know what
>>>>>>>>>>> is
>>>>>>>>>>> missing
>>>>>>>>>>> from my end.
>>>>>>>>>>>
>>>>>>>>>>> CONFIGURATION:
>>>>>>>>>>>
>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>>> contains
>>>>>>>>>>> your
>>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Windows XP
>>>>>>>>>>>
>>>>>>>>>>> Cygwin
>>>>>>>>>>>
>>>>>>>>>>> jdk1.6.0
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance…
>>>>>>>>>>>
>>>>>>>>>>> Please help....
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>>
>>>>>>>>>>> Biswajit
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>  http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
>>>>> debugenabled_hadoop.log
>>>>>
>>>> http://www.nabble.com/file/p19514374/latest.log latest.log
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
http://www.nabble.com/file/p19552519/new.txt new.txt 
-- 
View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19552519.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by Susam Pal <su...@gmail.com>.
The log file shows only one fetching line:

2008-09-16 20:46:15,321 INFO  fetcher.Fetcher - fetching
http://10.222.18.113:8080/dao/

This has been fetched successfully. There is no other page being
fetched. Have you set up Nutch properly so that it can fetch all the
pages you need? If it tries to fetch a page but fails due to
authentication, then it is a problem with authentication.

In this case, it is not even attempting to fetch those pages. So, the
problem lies elsewhere. You need to first find out why it is fetching
only one page and not others.

Regards,
Susam Pal

On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
<bi...@lntinfotech.com> wrote:
>
> But still it is not crawling the password protected pages...
>
> Regards,
> Biswajit.
>
>
> Susam Pal wrote:
>>
>> The latest log shows that the page from the URL:
>> http://10.222.18.113:8080/dao/ has been fetched successfully.
>>
>> Regards,
>> Susam Pal
>>
>> On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
>> <bi...@lntinfotech.com> wrote:
>>>
>>> Hi Susam,
>>>
>>> Please find the latest log file(latest.log), which shows different error.
>>>
>>> 2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
>>> http://10.222.18.113:8080/robots.txt; status code: 404; bytes received:
>>> 985;
>>> Content-Length: 985
>>> 2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
>>> http://10.222.18.113:8080/dao/; status code: 200; bytes received: 1941;
>>> Content-Length: 1941
>>>
>>> Thanks in advance...
>>>
>>> Best regards,
>>> Biswajit.
>>>
>>>
>>> biswajit_rout wrote:
>>>>
>>>> Hi Susam,
>>>>
>>>> Thanks for your immediate response...
>>>> Herewith i am attaching the debug enabled log
>>>> file(debugenabled_hadoop.log). Kindly go through the file and let me
>>>> know
>>>> what is missing from my end...
>>>>
>>>> Best regards,
>>>> Biswajit.
>>>>
>>>>
>>>> Susam Pal wrote:
>>>>>
>>>>> Hi Biswajit,
>>>>>
>>>>> The authscope specifies which IP address or domain-name would the
>>>>> credentials be used for. If you provide 10.222.18.113 in the
>>>>> authscope, the credentials would not be used for localhost even though
>>>>> both represent the same machine.
>>>>>
>>>>> Please provide logs with DEBUG enabled.
>>>>>
>>>>> Regards,
>>>>> Susam Pal
>>>>>
>>>>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>
>>>>>> Hi Susam,
>>>>>>
>>>>>> The ip 10.222.18.113 is nothing but the ip address of my
>>>>>> machine(localhost).
>>>>>> Now also i changed http://localhost:8080/ to
>>>>>> http://10.222.18.113:8080.
>>>>>> However no result, i mean to say still not able to crawl password
>>>>>> protected
>>>>>> pages.
>>>>>>
>>>>>> Kindly assist me to resolve this issue.
>>>>>>
>>>>>> Thanks in advance...
>>>>>>
>>>>>> Best regards,
>>>>>> Biswajit.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Susam Pal wrote:
>>>>>>>
>>>>>>> The logs show that it is fetching http://localhost:8080/ but you have
>>>>>>> set credentials for 10.222.18.113:8080 which is never being fetched.
>>>>>>> So, no authentication takes place.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Susam Pal
>>>>>>>
>>>>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>>
>>>>>>>> Hi Susam,
>>>>>>>>
>>>>>>>> In order to crawl password protected pages, I am using
>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>> contains
>>>>>>>> your
>>>>>>>> patch for HttpAuthentication)
>>>>>>>>
>>>>>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>>>>>
>>>>>>>> Please find the attached zip file which contains
>>>>>>>> nutch-site.xml,httpclient-auth.xml.
>>>>>>>>
>>>>>>>> Kindly provide me a solution for this.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Biswajit
>>>>>>>>
>>>>>>>>
>>>>>>>> Susam Pal wrote:
>>>>>>>>>
>>>>>>>>> Hi Biswajit,
>>>>>>>>>
>>>>>>>>> Could you please tell us how you have added the support for
>>>>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>>>>>>>> properly by default. The authentication feature is buggy in Nutch
>>>>>>>>> 0.9
>>>>>>>>> which was fixed with this ticket:
>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>>>>>
>>>>>>>>> The feature is documented here:
>>>>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>>>>>
>>>>>>>>> The easiest way to use it is to check out the latest version of
>>>>>>>>> Nutch
>>>>>>>>> and build it as it contains the authentication feature. If you want
>>>>>>>>> to
>>>>>>>>> use it with Nutch 0.9, you have to download the latest patch
>>>>>>>>> present
>>>>>>>>> in the ticket page and apply it to the source code and build it.
>>>>>>>>> You
>>>>>>>>> might have to resolve some conflicts manually.
>>>>>>>>>
>>>>>>>>> I would suggest that you do not send the mail same mail multiple
>>>>>>>>> times. We have received the same mail from you 4 times. It takes
>>>>>>>>> sometime for members to reply to a mail. :-)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Susam Pal
>>>>>>>>>
>>>>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>>>>>> <B1...@freescale.com> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have successfully configured NUTCH 0.9, which is crawling number
>>>>>>>>>> of
>>>>>>>>>> sites
>>>>>>>>>> and after that searching is also happening properly.
>>>>>>>>>>
>>>>>>>>>> However, now I want to crawl password protected pages using NUTCH.
>>>>>>>>>> In
>>>>>>>>>> order
>>>>>>>>>> to access those pages I should have a valid user name and
>>>>>>>>>> password.
>>>>>>>>>> I
>>>>>>>>>> have
>>>>>>>>>> configured the user name and password in my nutch-site.xml and
>>>>>>>>>> httpclient-auth.xml
>>>>>>>>>>
>>>>>>>>>> However it is not crawling.
>>>>>>>>>>
>>>>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log
>>>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>> Zip file for your reference. Kindly check and let me know what is
>>>>>>>>>> missing
>>>>>>>>>> from my end.
>>>>>>>>>>
>>>>>>>>>> CONFIGURATION:
>>>>>>>>>>
>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>>> contains
>>>>>>>>>> your
>>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Windows XP
>>>>>>>>>>
>>>>>>>>>> Cygwin
>>>>>>>>>>
>>>>>>>>>> jdk1.6.0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks in advance…
>>>>>>>>>>
>>>>>>>>>> Please help....
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>>
>>>>>>>>>> Biswajit
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>  http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
>>>> debugenabled_hadoop.log
>>>>
>>> http://www.nabble.com/file/p19514374/latest.log latest.log
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by biswajit_rout <bi...@lntinfotech.com>.
But still it is not crawling the password protected pages...

Regards,
Biswajit.


Susam Pal wrote:
> 
> The latest log shows that the page from the URL:
> http://10.222.18.113:8080/dao/ has been fetched successfully.
> 
> Regards,
> Susam Pal
> 
> On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
> <bi...@lntinfotech.com> wrote:
>>
>> Hi Susam,
>>
>> Please find the latest log file(latest.log), which shows different error.
>>
>> 2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
>> http://10.222.18.113:8080/robots.txt; status code: 404; bytes received:
>> 985;
>> Content-Length: 985
>> 2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
>> http://10.222.18.113:8080/dao/; status code: 200; bytes received: 1941;
>> Content-Length: 1941
>>
>> Thanks in advance...
>>
>> Best regards,
>> Biswajit.
>>
>>
>> biswajit_rout wrote:
>>>
>>> Hi Susam,
>>>
>>> Thanks for your immediate response...
>>> Herewith i am attaching the debug enabled log
>>> file(debugenabled_hadoop.log). Kindly go through the file and let me
>>> know
>>> what is missing from my end...
>>>
>>> Best regards,
>>> Biswajit.
>>>
>>>
>>> Susam Pal wrote:
>>>>
>>>> Hi Biswajit,
>>>>
>>>> The authscope specifies which IP address or domain-name would the
>>>> credentials be used for. If you provide 10.222.18.113 in the
>>>> authscope, the credentials would not be used for localhost even though
>>>> both represent the same machine.
>>>>
>>>> Please provide logs with DEBUG enabled.
>>>>
>>>> Regards,
>>>> Susam Pal
>>>>
>>>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
>>>> <bi...@lntinfotech.com> wrote:
>>>>>
>>>>> Hi Susam,
>>>>>
>>>>> The ip 10.222.18.113 is nothing but the ip address of my
>>>>> machine(localhost).
>>>>> Now also i changed http://localhost:8080/ to
>>>>> http://10.222.18.113:8080.
>>>>> However no result, i mean to say still not able to crawl password
>>>>> protected
>>>>> pages.
>>>>>
>>>>> Kindly assist me to resolve this issue.
>>>>>
>>>>> Thanks in advance...
>>>>>
>>>>> Best regards,
>>>>> Biswajit.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Susam Pal wrote:
>>>>>>
>>>>>> The logs show that it is fetching http://localhost:8080/ but you have
>>>>>> set credentials for 10.222.18.113:8080 which is never being fetched.
>>>>>> So, no authentication takes place.
>>>>>>
>>>>>> Regards,
>>>>>> Susam Pal
>>>>>>
>>>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>>
>>>>>>> Hi Susam,
>>>>>>>
>>>>>>> In order to crawl password protected pages, I am using
>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>> contains
>>>>>>> your
>>>>>>> patch for HttpAuthentication)
>>>>>>>
>>>>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>>>>
>>>>>>> Please find the attached zip file which contains
>>>>>>> nutch-site.xml,httpclient-auth.xml.
>>>>>>>
>>>>>>> Kindly provide me a solution for this.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Biswajit
>>>>>>>
>>>>>>>
>>>>>>> Susam Pal wrote:
>>>>>>>>
>>>>>>>> Hi Biswajit,
>>>>>>>>
>>>>>>>> Could you please tell us how you have added the support for
>>>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>>>>>>> properly by default. The authentication feature is buggy in Nutch
>>>>>>>> 0.9
>>>>>>>> which was fixed with this ticket:
>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>>>>
>>>>>>>> The feature is documented here:
>>>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>>>>
>>>>>>>> The easiest way to use it is to check out the latest version of
>>>>>>>> Nutch
>>>>>>>> and build it as it contains the authentication feature. If you want
>>>>>>>> to
>>>>>>>> use it with Nutch 0.9, you have to download the latest patch
>>>>>>>> present
>>>>>>>> in the ticket page and apply it to the source code and build it.
>>>>>>>> You
>>>>>>>> might have to resolve some conflicts manually.
>>>>>>>>
>>>>>>>> I would suggest that you do not send the mail same mail multiple
>>>>>>>> times. We have received the same mail from you 4 times. It takes
>>>>>>>> sometime for members to reply to a mail. :-)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Susam Pal
>>>>>>>>
>>>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>>>>> <B1...@freescale.com> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have successfully configured NUTCH 0.9, which is crawling number
>>>>>>>>> of
>>>>>>>>> sites
>>>>>>>>> and after that searching is also happening properly.
>>>>>>>>>
>>>>>>>>> However, now I want to crawl password protected pages using NUTCH.
>>>>>>>>> In
>>>>>>>>> order
>>>>>>>>> to access those pages I should have a valid user name and
>>>>>>>>> password.
>>>>>>>>> I
>>>>>>>>> have
>>>>>>>>> configured the user name and password in my nutch-site.xml and
>>>>>>>>> httpclient-auth.xml
>>>>>>>>>
>>>>>>>>> However it is not crawling.
>>>>>>>>>
>>>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log
>>>>>>>>> in
>>>>>>>>> the
>>>>>>>>> Zip file for your reference. Kindly check and let me know what is
>>>>>>>>> missing
>>>>>>>>> from my end.
>>>>>>>>>
>>>>>>>>> CONFIGURATION:
>>>>>>>>>
>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>>> contains
>>>>>>>>> your
>>>>>>>>> patch for HttpAuthentication)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Windows XP
>>>>>>>>>
>>>>>>>>> Cygwin
>>>>>>>>>
>>>>>>>>> jdk1.6.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks in advance…
>>>>>>>>>
>>>>>>>>> Please help....
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> Biswajit
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>>
>>>  http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
>>> debugenabled_hadoop.log
>>>
>> http://www.nabble.com/file/p19514374/latest.log latest.log
>> --
>> View this message in context:
>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by Susam Pal <su...@gmail.com>.
The latest log shows that the page from the URL:
http://10.222.18.113:8080/dao/ has been fetched successfully.

Regards,
Susam Pal

On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
<bi...@lntinfotech.com> wrote:
>
> Hi Susam,
>
> Please find the latest log file(latest.log), which shows different error.
>
> 2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
> http://10.222.18.113:8080/robots.txt; status code: 404; bytes received: 985;
> Content-Length: 985
> 2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
> http://10.222.18.113:8080/dao/; status code: 200; bytes received: 1941;
> Content-Length: 1941
>
> Thanks in advance...
>
> Best regards,
> Biswajit.
>
>
> biswajit_rout wrote:
>>
>> Hi Susam,
>>
>> Thanks for your immediate response...
>> Herewith i am attaching the debug enabled log
>> file(debugenabled_hadoop.log). Kindly go through the file and let me know
>> what is missing from my end...
>>
>> Best regards,
>> Biswajit.
>>
>>
>> Susam Pal wrote:
>>>
>>> Hi Biswajit,
>>>
>>> The authscope specifies which IP address or domain-name would the
>>> credentials be used for. If you provide 10.222.18.113 in the
>>> authscope, the credentials would not be used for localhost even though
>>> both represent the same machine.
>>>
>>> Please provide logs with DEBUG enabled.
>>>
>>> Regards,
>>> Susam Pal
>>>
>>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
>>> <bi...@lntinfotech.com> wrote:
>>>>
>>>> Hi Susam,
>>>>
>>>> The ip 10.222.18.113 is nothing but the ip address of my
>>>> machine(localhost).
>>>> Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
>>>> However no result, i mean to say still not able to crawl password
>>>> protected
>>>> pages.
>>>>
>>>> Kindly assist me to resolve this issue.
>>>>
>>>> Thanks in advance...
>>>>
>>>> Best regards,
>>>> Biswajit.
>>>>
>>>>
>>>>
>>>>
>>>> Susam Pal wrote:
>>>>>
>>>>> The logs show that it is fetching http://localhost:8080/ but you have
>>>>> set credentials for 10.222.18.113:8080 which is never being fetched.
>>>>> So, no authentication takes place.
>>>>>
>>>>> Regards,
>>>>> Susam Pal
>>>>>
>>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>>>> <bi...@lntinfotech.com> wrote:
>>>>>>
>>>>>> Hi Susam,
>>>>>>
>>>>>> In order to crawl password protected pages, I am using
>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>>>>>> your
>>>>>> patch for HttpAuthentication)
>>>>>>
>>>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>>>
>>>>>> Please find the attached zip file which contains
>>>>>> nutch-site.xml,httpclient-auth.xml.
>>>>>>
>>>>>> Kindly provide me a solution for this.
>>>>>>
>>>>>> Best regards,
>>>>>> Biswajit
>>>>>>
>>>>>>
>>>>>> Susam Pal wrote:
>>>>>>>
>>>>>>> Hi Biswajit,
>>>>>>>
>>>>>>> Could you please tell us how you have added the support for
>>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>>>>>> properly by default. The authentication feature is buggy in Nutch 0.9
>>>>>>> which was fixed with this ticket:
>>>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>>>
>>>>>>> The feature is documented here:
>>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>>>
>>>>>>> The easiest way to use it is to check out the latest version of Nutch
>>>>>>> and build it as it contains the authentication feature. If you want
>>>>>>> to
>>>>>>> use it with Nutch 0.9, you have to download the latest patch present
>>>>>>> in the ticket page and apply it to the source code and build it. You
>>>>>>> might have to resolve some conflicts manually.
>>>>>>>
>>>>>>> I would suggest that you do not send the mail same mail multiple
>>>>>>> times. We have received the same mail from you 4 times. It takes
>>>>>>> sometime for members to reply to a mail. :-)
>>>>>>>
>>>>>>> Regards,
>>>>>>> Susam Pal
>>>>>>>
>>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>>>> <B1...@freescale.com> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have successfully configured NUTCH 0.9, which is crawling number
>>>>>>>> of
>>>>>>>> sites
>>>>>>>> and after that searching is also happening properly.
>>>>>>>>
>>>>>>>> However, now I want to crawl password protected pages using NUTCH.
>>>>>>>> In
>>>>>>>> order
>>>>>>>> to access those pages I should have a valid user name and password.
>>>>>>>> I
>>>>>>>> have
>>>>>>>> configured the user name and password in my nutch-site.xml and
>>>>>>>> httpclient-auth.xml
>>>>>>>>
>>>>>>>> However it is not crawling.
>>>>>>>>
>>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log
>>>>>>>> in
>>>>>>>> the
>>>>>>>> Zip file for your reference. Kindly check and let me know what is
>>>>>>>> missing
>>>>>>>> from my end.
>>>>>>>>
>>>>>>>> CONFIGURATION:
>>>>>>>>
>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>>> contains
>>>>>>>> your
>>>>>>>> patch for HttpAuthentication)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Windows XP
>>>>>>>>
>>>>>>>> Cygwin
>>>>>>>>
>>>>>>>> jdk1.6.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks in advance…
>>>>>>>>
>>>>>>>> Please help....
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Biswajit
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>  http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
>> debugenabled_hadoop.log
>>
> http://www.nabble.com/file/p19514374/latest.log latest.log
> --
> View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by biswajit_rout <bi...@lntinfotech.com>.
Hi Susam,

Please find the latest log file(latest.log), which shows different error.

2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
http://10.222.18.113:8080/robots.txt; status code: 404; bytes received: 985;
Content-Length: 985
2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
http://10.222.18.113:8080/dao/; status code: 200; bytes received: 1941;
Content-Length: 1941

Thanks in advance...

Best regards,
Biswajit.


biswajit_rout wrote:
> 
> Hi Susam,
> 
> Thanks for your immediate response...
> Herewith i am attaching the debug enabled log
> file(debugenabled_hadoop.log). Kindly go through the file and let me know
> what is missing from my end...
> 
> Best regards,
> Biswajit.
> 
> 
> Susam Pal wrote:
>> 
>> Hi Biswajit,
>> 
>> The authscope specifies which IP address or domain-name would the
>> credentials be used for. If you provide 10.222.18.113 in the
>> authscope, the credentials would not be used for localhost even though
>> both represent the same machine.
>> 
>> Please provide logs with DEBUG enabled.
>> 
>> Regards,
>> Susam Pal
>> 
>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
>> <bi...@lntinfotech.com> wrote:
>>>
>>> Hi Susam,
>>>
>>> The ip 10.222.18.113 is nothing but the ip address of my
>>> machine(localhost).
>>> Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
>>> However no result, i mean to say still not able to crawl password
>>> protected
>>> pages.
>>>
>>> Kindly assist me to resolve this issue.
>>>
>>> Thanks in advance...
>>>
>>> Best regards,
>>> Biswajit.
>>>
>>>
>>>
>>>
>>> Susam Pal wrote:
>>>>
>>>> The logs show that it is fetching http://localhost:8080/ but you have
>>>> set credentials for 10.222.18.113:8080 which is never being fetched.
>>>> So, no authentication takes place.
>>>>
>>>> Regards,
>>>> Susam Pal
>>>>
>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>>> <bi...@lntinfotech.com> wrote:
>>>>>
>>>>> Hi Susam,
>>>>>
>>>>> In order to crawl password protected pages, I am using
>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>>>>> your
>>>>> patch for HttpAuthentication)
>>>>>
>>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>>
>>>>> Please find the attached zip file which contains
>>>>> nutch-site.xml,httpclient-auth.xml.
>>>>>
>>>>> Kindly provide me a solution for this.
>>>>>
>>>>> Best regards,
>>>>> Biswajit
>>>>>
>>>>>
>>>>> Susam Pal wrote:
>>>>>>
>>>>>> Hi Biswajit,
>>>>>>
>>>>>> Could you please tell us how you have added the support for
>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>>>>> properly by default. The authentication feature is buggy in Nutch 0.9
>>>>>> which was fixed with this ticket:
>>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>>
>>>>>> The feature is documented here:
>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>>
>>>>>> The easiest way to use it is to check out the latest version of Nutch
>>>>>> and build it as it contains the authentication feature. If you want
>>>>>> to
>>>>>> use it with Nutch 0.9, you have to download the latest patch present
>>>>>> in the ticket page and apply it to the source code and build it. You
>>>>>> might have to resolve some conflicts manually.
>>>>>>
>>>>>> I would suggest that you do not send the mail same mail multiple
>>>>>> times. We have received the same mail from you 4 times. It takes
>>>>>> sometime for members to reply to a mail. :-)
>>>>>>
>>>>>> Regards,
>>>>>> Susam Pal
>>>>>>
>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>>> <B1...@freescale.com> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have successfully configured NUTCH 0.9, which is crawling number
>>>>>>> of
>>>>>>> sites
>>>>>>> and after that searching is also happening properly.
>>>>>>>
>>>>>>> However, now I want to crawl password protected pages using NUTCH.
>>>>>>> In
>>>>>>> order
>>>>>>> to access those pages I should have a valid user name and password.
>>>>>>> I
>>>>>>> have
>>>>>>> configured the user name and password in my nutch-site.xml and
>>>>>>> httpclient-auth.xml
>>>>>>>
>>>>>>> However it is not crawling.
>>>>>>>
>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log
>>>>>>> in
>>>>>>> the
>>>>>>> Zip file for your reference. Kindly check and let me know what is
>>>>>>> missing
>>>>>>> from my end.
>>>>>>>
>>>>>>> CONFIGURATION:
>>>>>>>
>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
>>>>>>> contains
>>>>>>> your
>>>>>>> patch for HttpAuthentication)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Windows XP
>>>>>>>
>>>>>>> Cygwin
>>>>>>>
>>>>>>> jdk1.6.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance…
>>>>>>>
>>>>>>> Please help....
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Biswajit
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>>
>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
>  http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
> debugenabled_hadoop.log 
> 
http://www.nabble.com/file/p19514374/latest.log latest.log 
-- 
View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by biswajit_rout <bi...@lntinfotech.com>.
Hi Susam,

Thanks for your immediate response...
Herewith i am attaching the debug enabled log file(debugenabled_hadoop.log).
Kindly go through the file and let me know what is missing from my end...

Best regards,
Biswajit.


Susam Pal wrote:
> 
> Hi Biswajit,
> 
> The authscope specifies which IP address or domain-name would the
> credentials be used for. If you provide 10.222.18.113 in the
> authscope, the credentials would not be used for localhost even though
> both represent the same machine.
> 
> Please provide logs with DEBUG enabled.
> 
> Regards,
> Susam Pal
> 
> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
> <bi...@lntinfotech.com> wrote:
>>
>> Hi Susam,
>>
>> The ip 10.222.18.113 is nothing but the ip address of my
>> machine(localhost).
>> Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
>> However no result, i mean to say still not able to crawl password
>> protected
>> pages.
>>
>> Kindly assist me to resolve this issue.
>>
>> Thanks in advance...
>>
>> Best regards,
>> Biswajit.
>>
>>
>>
>>
>> Susam Pal wrote:
>>>
>>> The logs show that it is fetching http://localhost:8080/ but you have
>>> set credentials for 10.222.18.113:8080 which is never being fetched.
>>> So, no authentication takes place.
>>>
>>> Regards,
>>> Susam Pal
>>>
>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>>> <bi...@lntinfotech.com> wrote:
>>>>
>>>> Hi Susam,
>>>>
>>>> In order to crawl password protected pages, I am using
>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>>>> your
>>>> patch for HttpAuthentication)
>>>>
>>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>>
>>>> Please find the attached zip file which contains
>>>> nutch-site.xml,httpclient-auth.xml.
>>>>
>>>> Kindly provide me a solution for this.
>>>>
>>>> Best regards,
>>>> Biswajit
>>>>
>>>>
>>>> Susam Pal wrote:
>>>>>
>>>>> Hi Biswajit,
>>>>>
>>>>> Could you please tell us how you have added the support for
>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>>>> properly by default. The authentication feature is buggy in Nutch 0.9
>>>>> which was fixed with this ticket:
>>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>>
>>>>> The feature is documented here:
>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>>
>>>>> The easiest way to use it is to check out the latest version of Nutch
>>>>> and build it as it contains the authentication feature. If you want to
>>>>> use it with Nutch 0.9, you have to download the latest patch present
>>>>> in the ticket page and apply it to the source code and build it. You
>>>>> might have to resolve some conflicts manually.
>>>>>
>>>>> I would suggest that you do not send the mail same mail multiple
>>>>> times. We have received the same mail from you 4 times. It takes
>>>>> sometime for members to reply to a mail. :-)
>>>>>
>>>>> Regards,
>>>>> Susam Pal
>>>>>
>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>>> <B1...@freescale.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I have successfully configured NUTCH 0.9, which is crawling number of
>>>>>> sites
>>>>>> and after that searching is also happening properly.
>>>>>>
>>>>>> However, now I want to crawl password protected pages using NUTCH. In
>>>>>> order
>>>>>> to access those pages I should have a valid user name and password. I
>>>>>> have
>>>>>> configured the user name and password in my nutch-site.xml and
>>>>>> httpclient-auth.xml
>>>>>>
>>>>>> However it is not crawling.
>>>>>>
>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in
>>>>>> the
>>>>>> Zip file for your reference. Kindly check and let me know what is
>>>>>> missing
>>>>>> from my end.
>>>>>>
>>>>>> CONFIGURATION:
>>>>>>
>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>>>>>> your
>>>>>> patch for HttpAuthentication)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Windows XP
>>>>>>
>>>>>> Cygwin
>>>>>>
>>>>>> jdk1.6.0
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks in advance…
>>>>>>
>>>>>> Please help....
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Biswajit
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
>> --
>> View this message in context:
>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
debugenabled_hadoop.log 
-- 
View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19510820.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by Susam Pal <su...@gmail.com>.
Hi Biswajit,

The authscope specifies which IP address or domain-name would the
credentials be used for. If you provide 10.222.18.113 in the
authscope, the credentials would not be used for localhost even though
both represent the same machine.

Please provide logs with DEBUG enabled.

Regards,
Susam Pal

On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
<bi...@lntinfotech.com> wrote:
>
> Hi Susam,
>
> The ip 10.222.18.113 is nothing but the ip address of my machine(localhost).
> Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
> However no result, i mean to say still not able to crawl password protected
> pages.
>
> Kindly assist me to resolve this issue.
>
> Thanks in advance...
>
> Best regards,
> Biswajit.
>
>
>
>
> Susam Pal wrote:
>>
>> The logs show that it is fetching http://localhost:8080/ but you have
>> set credentials for 10.222.18.113:8080 which is never being fetched.
>> So, no authentication takes place.
>>
>> Regards,
>> Susam Pal
>>
>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>> <bi...@lntinfotech.com> wrote:
>>>
>>> Hi Susam,
>>>
>>> In order to crawl password protected pages, I am using
>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>>> your
>>> patch for HttpAuthentication)
>>>
>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>
>>> Please find the attached zip file which contains
>>> nutch-site.xml,httpclient-auth.xml.
>>>
>>> Kindly provide me a solution for this.
>>>
>>> Best regards,
>>> Biswajit
>>>
>>>
>>> Susam Pal wrote:
>>>>
>>>> Hi Biswajit,
>>>>
>>>> Could you please tell us how you have added the support for
>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>>> properly by default. The authentication feature is buggy in Nutch 0.9
>>>> which was fixed with this ticket:
>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>
>>>> The feature is documented here:
>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>
>>>> The easiest way to use it is to check out the latest version of Nutch
>>>> and build it as it contains the authentication feature. If you want to
>>>> use it with Nutch 0.9, you have to download the latest patch present
>>>> in the ticket page and apply it to the source code and build it. You
>>>> might have to resolve some conflicts manually.
>>>>
>>>> I would suggest that you do not send the mail same mail multiple
>>>> times. We have received the same mail from you 4 times. It takes
>>>> sometime for members to reply to a mail. :-)
>>>>
>>>> Regards,
>>>> Susam Pal
>>>>
>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>> <B1...@freescale.com> wrote:
>>>>> Hi,
>>>>>
>>>>> I have successfully configured NUTCH 0.9, which is crawling number of
>>>>> sites
>>>>> and after that searching is also happening properly.
>>>>>
>>>>> However, now I want to crawl password protected pages using NUTCH. In
>>>>> order
>>>>> to access those pages I should have a valid user name and password. I
>>>>> have
>>>>> configured the user name and password in my nutch-site.xml and
>>>>> httpclient-auth.xml
>>>>>
>>>>> However it is not crawling.
>>>>>
>>>>> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in
>>>>> the
>>>>> Zip file for your reference. Kindly check and let me know what is
>>>>> missing
>>>>> from my end.
>>>>>
>>>>> CONFIGURATION:
>>>>>
>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>>>>> your
>>>>> patch for HttpAuthentication)
>>>>>
>>>>>
>>>>>
>>>>> Windows XP
>>>>>
>>>>> Cygwin
>>>>>
>>>>> jdk1.6.0
>>>>>
>>>>>
>>>>>
>>>>> Thanks in advance…
>>>>>
>>>>> Please help....
>>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Biswajit
>>>>>
>>>>>
>>>>
>>>>
>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
> --
> View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by biswajit_rout <bi...@lntinfotech.com>.
I have also attached hadoop.log for your reference...


biswajit_rout wrote:
> 
> Hi Susam,
> 
> The ip 10.222.18.113 is nothing but the ip address of my
> machine(localhost).
> Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
> However no result, i mean to say still not able to crawl password
> protected pages.
> 
> Kindly assist me to resolve this issue.
> 
> Thanks in advance...
> 
> Best regards,
> Biswajit.
> 
> 
> 
> 
> Susam Pal wrote:
>> 
>> The logs show that it is fetching http://localhost:8080/ but you have
>> set credentials for 10.222.18.113:8080 which is never being fetched.
>> So, no authentication takes place.
>> 
>> Regards,
>> Susam Pal
>> 
>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
>> <bi...@lntinfotech.com> wrote:
>>>
>>> Hi Susam,
>>>
>>> In order to crawl password protected pages, I am using
>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>>> your
>>> patch for HttpAuthentication)
>>>
>>> I have modified nutch-site.xml, httpclient-auth.xml.
>>>
>>> Please find the attached zip file which contains
>>> nutch-site.xml,httpclient-auth.xml.
>>>
>>> Kindly provide me a solution for this.
>>>
>>> Best regards,
>>> Biswajit
>>>
>>>
>>> Susam Pal wrote:
>>>>
>>>> Hi Biswajit,
>>>>
>>>> Could you please tell us how you have added the support for
>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>>> properly by default. The authentication feature is buggy in Nutch 0.9
>>>> which was fixed with this ticket:
>>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>>
>>>> The feature is documented here:
>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>>
>>>> The easiest way to use it is to check out the latest version of Nutch
>>>> and build it as it contains the authentication feature. If you want to
>>>> use it with Nutch 0.9, you have to download the latest patch present
>>>> in the ticket page and apply it to the source code and build it. You
>>>> might have to resolve some conflicts manually.
>>>>
>>>> I would suggest that you do not send the mail same mail multiple
>>>> times. We have received the same mail from you 4 times. It takes
>>>> sometime for members to reply to a mail. :-)
>>>>
>>>> Regards,
>>>> Susam Pal
>>>>
>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>>> <B1...@freescale.com> wrote:
>>>>> Hi,
>>>>>
>>>>> I have successfully configured NUTCH 0.9, which is crawling number of
>>>>> sites
>>>>> and after that searching is also happening properly.
>>>>>
>>>>> However, now I want to crawl password protected pages using NUTCH. In
>>>>> order
>>>>> to access those pages I should have a valid user name and password. I
>>>>> have
>>>>> configured the user name and password in my nutch-site.xml and
>>>>> httpclient-auth.xml
>>>>>
>>>>> However it is not crawling.
>>>>>
>>>>> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in
>>>>> the
>>>>> Zip file for your reference. Kindly check and let me know what is
>>>>> missing
>>>>> from my end.
>>>>>
>>>>> CONFIGURATION:
>>>>>
>>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>>>>> your
>>>>> patch for HttpAuthentication)
>>>>>
>>>>>
>>>>>
>>>>> Windows XP
>>>>>
>>>>> Cygwin
>>>>>
>>>>> jdk1.6.0
>>>>>
>>>>>
>>>>>
>>>>> Thanks in advance…
>>>>>
>>>>> Please help....
>>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Biswajit
>>>>>
>>>>>
>>>>
>>>>
>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
>  http://www.nabble.com/file/p19507146/hadoop.log hadoop.log 
> 

-- 
View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507198.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by biswajit_rout <bi...@lntinfotech.com>.
Hi Susam,

The ip 10.222.18.113 is nothing but the ip address of my machine(localhost).
Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
However no result, i mean to say still not able to crawl password protected
pages.

Kindly assist me to resolve this issue.

Thanks in advance...

Best regards,
Biswajit.




Susam Pal wrote:
> 
> The logs show that it is fetching http://localhost:8080/ but you have
> set credentials for 10.222.18.113:8080 which is never being fetched.
> So, no authentication takes place.
> 
> Regards,
> Susam Pal
> 
> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
> <bi...@lntinfotech.com> wrote:
>>
>> Hi Susam,
>>
>> In order to crawl password protected pages, I am using
>> nutch-2008-07-10_04-01-48.tar (I have download from
>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>> your
>> patch for HttpAuthentication)
>>
>> I have modified nutch-site.xml, httpclient-auth.xml.
>>
>> Please find the attached zip file which contains
>> nutch-site.xml,httpclient-auth.xml.
>>
>> Kindly provide me a solution for this.
>>
>> Best regards,
>> Biswajit
>>
>>
>> Susam Pal wrote:
>>>
>>> Hi Biswajit,
>>>
>>> Could you please tell us how you have added the support for
>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>>> properly by default. The authentication feature is buggy in Nutch 0.9
>>> which was fixed with this ticket:
>>> https://issues.apache.org/jira/browse/NUTCH-559
>>>
>>> The feature is documented here:
>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>>
>>> The easiest way to use it is to check out the latest version of Nutch
>>> and build it as it contains the authentication feature. If you want to
>>> use it with Nutch 0.9, you have to download the latest patch present
>>> in the ticket page and apply it to the source code and build it. You
>>> might have to resolve some conflicts manually.
>>>
>>> I would suggest that you do not send the mail same mail multiple
>>> times. We have received the same mail from you 4 times. It takes
>>> sometime for members to reply to a mail. :-)
>>>
>>> Regards,
>>> Susam Pal
>>>
>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>>> <B1...@freescale.com> wrote:
>>>> Hi,
>>>>
>>>> I have successfully configured NUTCH 0.9, which is crawling number of
>>>> sites
>>>> and after that searching is also happening properly.
>>>>
>>>> However, now I want to crawl password protected pages using NUTCH. In
>>>> order
>>>> to access those pages I should have a valid user name and password. I
>>>> have
>>>> configured the user name and password in my nutch-site.xml and
>>>> httpclient-auth.xml
>>>>
>>>> However it is not crawling.
>>>>
>>>> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in
>>>> the
>>>> Zip file for your reference. Kindly check and let me know what is
>>>> missing
>>>> from my end.
>>>>
>>>> CONFIGURATION:
>>>>
>>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>>>> your
>>>> patch for HttpAuthentication)
>>>>
>>>>
>>>>
>>>> Windows XP
>>>>
>>>> Cygwin
>>>>
>>>> jdk1.6.0
>>>>
>>>>
>>>>
>>>> Thanks in advance…
>>>>
>>>> Please help....
>>>>
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Biswajit
>>>>
>>>>
>>>
>>>
>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
>> --
>> View this message in context:
>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
http://www.nabble.com/file/p19507146/hadoop.log hadoop.log 
-- 
View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by Susam Pal <su...@gmail.com>.
The logs show that it is fetching http://localhost:8080/ but you have
set credentials for 10.222.18.113:8080 which is never being fetched.
So, no authentication takes place.

Regards,
Susam Pal

On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
<bi...@lntinfotech.com> wrote:
>
> Hi Susam,
>
> In order to crawl password protected pages, I am using
> nutch-2008-07-10_04-01-48.tar (I have download from
> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains your
> patch for HttpAuthentication)
>
> I have modified nutch-site.xml, httpclient-auth.xml.
>
> Please find the attached zip file which contains
> nutch-site.xml,httpclient-auth.xml.
>
> Kindly provide me a solution for this.
>
> Best regards,
> Biswajit
>
>
> Susam Pal wrote:
>>
>> Hi Biswajit,
>>
>> Could you please tell us how you have added the support for
>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
>> properly by default. The authentication feature is buggy in Nutch 0.9
>> which was fixed with this ticket:
>> https://issues.apache.org/jira/browse/NUTCH-559
>>
>> The feature is documented here:
>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>
>> The easiest way to use it is to check out the latest version of Nutch
>> and build it as it contains the authentication feature. If you want to
>> use it with Nutch 0.9, you have to download the latest patch present
>> in the ticket page and apply it to the source code and build it. You
>> might have to resolve some conflicts manually.
>>
>> I would suggest that you do not send the mail same mail multiple
>> times. We have received the same mail from you 4 times. It takes
>> sometime for members to reply to a mail. :-)
>>
>> Regards,
>> Susam Pal
>>
>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
>> <B1...@freescale.com> wrote:
>>> Hi,
>>>
>>> I have successfully configured NUTCH 0.9, which is crawling number of
>>> sites
>>> and after that searching is also happening properly.
>>>
>>> However, now I want to crawl password protected pages using NUTCH. In
>>> order
>>> to access those pages I should have a valid user name and password. I
>>> have
>>> configured the user name and password in my nutch-site.xml and
>>> httpclient-auth.xml
>>>
>>> However it is not crawling.
>>>
>>> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in the
>>> Zip file for your reference. Kindly check and let me know what is missing
>>> from my end.
>>>
>>> CONFIGURATION:
>>>
>>> nutch-2008-07-10_04-01-48.tar (I have download from
>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>>> your
>>> patch for HttpAuthentication)
>>>
>>>
>>>
>>> Windows XP
>>>
>>> Cygwin
>>>
>>> jdk1.6.0
>>>
>>>
>>>
>>> Thanks in advance…
>>>
>>> Please help....
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Biswajit
>>>
>>>
>>
>>
> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
> --
> View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by biswajit_rout <bi...@lntinfotech.com>.
Hi Susam,

In order to crawl password protected pages, I am using 
nutch-2008-07-10_04-01-48.tar (I have download from
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains your
patch for HttpAuthentication)

I have modified nutch-site.xml, httpclient-auth.xml.

Please find the attached zip file which contains
nutch-site.xml,httpclient-auth.xml.

Kindly provide me a solution for this.

Best regards,
Biswajit


Susam Pal wrote:
> 
> Hi Biswajit,
> 
> Could you please tell us how you have added the support for
> authentication in Nutch 0.9? Nutch 0.9 can not do authentication
> properly by default. The authentication feature is buggy in Nutch 0.9
> which was fixed with this ticket:
> https://issues.apache.org/jira/browse/NUTCH-559
> 
> The feature is documented here:
> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
> 
> The easiest way to use it is to check out the latest version of Nutch
> and build it as it contains the authentication feature. If you want to
> use it with Nutch 0.9, you have to download the latest patch present
> in the ticket page and apply it to the source code and build it. You
> might have to resolve some conflicts manually.
> 
> I would suggest that you do not send the mail same mail multiple
> times. We have received the same mail from you 4 times. It takes
> sometime for members to reply to a mail. :-)
> 
> Regards,
> Susam Pal
> 
> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
> <B1...@freescale.com> wrote:
>> Hi,
>>
>> I have successfully configured NUTCH 0.9, which is crawling number of
>> sites
>> and after that searching is also happening properly.
>>
>> However, now I want to crawl password protected pages using NUTCH. In
>> order
>> to access those pages I should have a valid user name and password. I
>> have
>> configured the user name and password in my nutch-site.xml and
>> httpclient-auth.xml
>>
>> However it is not crawling.
>>
>> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in the
>> Zip file for your reference. Kindly check and let me know what is missing
>> from my end.
>>
>> CONFIGURATION:
>>
>> nutch-2008-07-10_04-01-48.tar (I have download from
>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
>> your
>> patch for HttpAuthentication)
>>
>>
>>
>> Windows XP
>>
>> Cygwin
>>
>> jdk1.6.0
>>
>>
>>
>> Thanks in advance…
>>
>> Please help....
>>
>>
>>
>> Best regards,
>>
>> Biswajit
>>
>>
> 
> 
http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip 
-- 
View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Not able to crawl password protected pages using NUTCH 0.9

Posted by Susam Pal <su...@gmail.com>.
Hi Biswajit,

Could you please tell us how you have added the support for
authentication in Nutch 0.9? Nutch 0.9 can not do authentication
properly by default. The authentication feature is buggy in Nutch 0.9
which was fixed with this ticket:
https://issues.apache.org/jira/browse/NUTCH-559

The feature is documented here:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The easiest way to use it is to check out the latest version of Nutch
and build it as it contains the authentication feature. If you want to
use it with Nutch 0.9, you have to download the latest patch present
in the ticket page and apply it to the source code and build it. You
might have to resolve some conflicts manually.

I would suggest that you do not send the mail same mail multiple
times. We have received the same mail from you 4 times. It takes
sometime for members to reply to a mail. :-)

Regards,
Susam Pal

On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
<B1...@freescale.com> wrote:
> Hi,
>
> I have successfully configured NUTCH 0.9, which is crawling number of sites
> and after that searching is also happening properly.
>
> However, now I want to crawl password protected pages using NUTCH. In order
> to access those pages I should have a valid user name and password. I have
> configured the user name and password in my nutch-site.xml and
> httpclient-auth.xml
>
> However it is not crawling.
>
> I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in the
> Zip file for your reference. Kindly check and let me know what is missing
> from my end.
>
> CONFIGURATION:
>
> nutch-2008-07-10_04-01-48.tar (I have download from
> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains your
> patch for HttpAuthentication)
>
>
>
> Windows XP
>
> Cygwin
>
> jdk1.6.0
>
>
>
> Thanks in advance…
>
> Please help....
>
>
>
> Best regards,
>
> Biswajit
>
>