You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Diego Bonesso <di...@gmail.com> on 2013/10/14 22:09:56 UTC

Only in domain / authentication

Hello, I have two questions? I'm using nutch 2.2. I put two urls in
seed.txt . In  dir /conf in nutch-site.xml, I create a property
db.ignore.external.links with value true. First question my job should stay
only in two urls domains? In the second url I have to authenticate , how i
can configure this? The url auth is something like
http://www.domain.com/login. Thanks a lot.

Re: Only in domain / authentication

Posted by Diego Bonesso <di...@gmail.com>.
Hello,I configured seed.txt with http://example.com.br site. This site has
a authentication session in http://example.com.br/login. I created a rule
in httpclient-auth.xml as follow:

<auth -configuration>
<credentials username="user" password="1111">
<authscope host="186.xxx.161.xxx" port="80" realm="login"/>
</credentials>
</auth -configuration>

First how can I ensure that nutch used authentication?
Second  how I can fetch all site?

Thanks!!!



On Tue, Oct 15, 2013 at 1:23 AM, Talat UYARER <ta...@agmlab.com>wrote:

> Hi Diego,
> First Question:
> db.ignore.external.links property is correct for staying in domain.
>
> Second Question:
> If you need authentication, I should use protocol-htttpclient instead of
> protocol-http. You should changes plugins.include and you should add
>
> <property>
> <name>http.auth.file</name>
> <value>httpclient-auth.xml</**value>
> <description></description>
> </property>
>
> property in your nutch-site.xml. httpclient-auth.xml is your auth
> configuration file. You can add your auth configuration. You can see some
> example in this file's comment lines.
>
> Talat
>
>
> 14-10-2013 23:09 tarihinde, Diego Bonesso yazdı:
>
>  Hello, I have two questions? I'm using nutch 2.2. I put two urls in
>> seed.txt . In  dir /conf in nutch-site.xml, I create a property
>> db.ignore.external.links with value true. First question my job should
>> stay
>> only in two urls domains? In the second url I have to authenticate , how i
>> can configure this? The url auth is something like
>> http://www.domain.com/login. Thanks a lot.
>>
>>
>

Re: Only in domain / authentication

Posted by Talat UYARER <ta...@agmlab.com>.
Hi Diego,
First Question:
db.ignore.external.links property is correct for staying in domain.

Second Question:
If you need authentication, I should use protocol-htttpclient instead of 
protocol-http. You should changes plugins.include and you should add

<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
<description></description>
</property>

property in your nutch-site.xml. httpclient-auth.xml is your auth 
configuration file. You can add your auth configuration. You can see 
some example in this file's comment lines.

Talat


14-10-2013 23:09 tarihinde, Diego Bonesso yazdı:
> Hello, I have two questions? I'm using nutch 2.2. I put two urls in
> seed.txt . In  dir /conf in nutch-site.xml, I create a property
> db.ignore.external.links with value true. First question my job should stay
> only in two urls domains? In the second url I have to authenticate , how i
> can configure this? The url auth is something like
> http://www.domain.com/login. Thanks a lot.
>