You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by NIDHI MALIK <mm...@cse.iitb.ac.in> on 2007/12/28 12:28:04 UTC

nutch internet crawling help

Hello,
      I am facing problem in using Nutch to crawl data from web. I have
configured Nutch-site.XML and Nutch-default.XML but still "HTTP 407
error authentication failure" message is displayed. I have also set
the http_proxies.

I have also tried wget. at the time of local crawling The following msg is
displayed.

------------------------------
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException:
/home/nidhi/Nutch_Installation/nutch-0.8.1/logs/hadoop.log (Permission
denied)
        at java.io.FileOutputStream.openAppend(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:177)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:102)
        at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
        at
org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)
        at
org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:215)
        at
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)
        at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132)
        at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96)
        at
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:654)
        at
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:612)
        at
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:509)
        at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:415)
        at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:441)
        at
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:468)
        at org.apache.log4j.LogManager.<clinit>(LogManager.java:122)
        at org.apache.log4j.Logger.getLogger(Logger.java:104)
        at
org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)
        at
org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:65)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:529)
        at
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:235)
        at
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:209)
        at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:351)
        at org.apache.nutch.crawl.Injector.<clinit>(Injector.java:40)



------------------------------


Can anyone plz suggest the solution.


Thanks



Re: nutch internet crawling help

Posted by Susam Pal <su...@gmail.com>.
Hi,

These type of questions should actually go into
nutch-user@lucene.apache.org (the nutch-user mailing list). So, I am
sending my reply to the nutch-user list with you in the CC field.

Regarding your question, you haven't provided the logs for the
authentication failure. You describe that you get "HTTP 407 error
authentication failure" but your log shows permission denied for
hadoop.log.

The first error occurs because you have not set the proxy
authentication details. You can do so in conf/nutch-site.xml by adding
the following properties:-

<property>
  <name>http.proxy.username</name>
  <value></value>
  <description>Username for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  NOTE: For NTLM authentication, do not prefix the username with the
  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
  </description>
</property>

<property>
  <name>http.proxy.password</name>
  <value></value>
  <description>Password for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  </description>
</property>

<property>
  <name>http.proxy.realm</name>
  <value></value>
  <description>Authentication realm for proxy. Do not define a value
  if realm is not required or authentication should take place for any
  realm. NTLM does not use the notion of realms. Specify the domain name
  of NTLM authentication as the value for this property. To use this,
  'protocol-httpclient' must be present in the value of
  'plugin.includes' property.
  </description>
</property>

<property>
  <name>http.agent.host</name>
  <value></value>
  <description>Name or IP address of the host on which the Nutch crawler
  would be running. Currently this is used by 'protocol-httpclient'
  plugin.
  </description>
</property>

You have to use protocol-httpclient instead of protocol-http for proxy
authentication to happen. For this, you have to override the
plugin.includes property in conf/nutch-site.xml. Example:-

<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

The second error seems to occur probably because you do not have
permission over the log file, hadoop.log. Checking the permissions and
setting the proper permissions might work.

Regards,
Susam Pal

On Dec 28, 2007 4:58 PM, NIDHI MALIK <mm...@cse.iitb.ac.in> wrote to
nutch-dev@lucene.apache.org:
>
> Hello,
>       I am facing problem in using Nutch to crawl data from web. I have
> configured Nutch-site.XML and Nutch-default.XML but still "HTTP 407
> error authentication failure" message is displayed. I have also set
> the http_proxies.
>
> I have also tried wget. at the time of local crawling The following msg is
> displayed.
>
> ------------------------------
> log4j:ERROR setFile(null,true) call failed.
> java.io.FileNotFoundException:
> /home/nidhi/Nutch_Installation/nutch-0.8.1/logs/hadoop.log (Permission
> denied)
>         at java.io.FileOutputStream.openAppend(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:177)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:102)
>         at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
>         at
> org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)
>         at
> org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:215)
>         at
> org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)
>         at
> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132)
>         at
> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96)
>         at
> org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:654)
>         at
> org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:612)
>         at
> org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:509)
>         at
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:415)
>         at
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:441)
>         at
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:468)
>         at org.apache.log4j.LogManager.<clinit>(LogManager.java:122)
>         at org.apache.log4j.Logger.getLogger(Logger.java:104)
>         at
> org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)
>         at
> org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:65)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at
> org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:529)
>         at
> org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:235)
>         at
> org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:209)
>         at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:351)
>         at org.apache.nutch.crawl.Injector.<clinit>(Injector.java:40)
>
>
>
> ------------------------------
>
>
> Can anyone plz suggest the solution.
>
>
> Thanks
>
>
>