You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Casey McTaggart <ca...@gmail.com> on 2012/09/25 23:25:58 UTC

crawl SMB server using Nutch and Hadoop?

hi,

has anyone been able to successfully crawl a SMB server using the deployed
version of Nutch?
my urls/seed.txt looks like this:
    smb:///servername//

my regex-urlfilter.txt is configured to accept everything.

I can run a local crawl that is able to connect to the windows share and
crawl it, until I get the following error:
   org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
any valid local directory for output/file.out
which I assume means I'm out of disk space, although I don't understand why
that is since doing a df -h shows I should have space left.
but in any case, it does crawl until it hits this error, so it successfully
connects to the SMB server and collects some files.

when running in deploy mode, however, I always get
12/09/25 15:03:27 WARN crawl.Generator: Generator: 0 records selected for
fetching, exiting ...
12/09/25 15:03:27 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to
fetch.

so, it doesn't think the SMB url is valid to even connect to, I guess? I
think I installed the plugin correctly, since it does the right thing in
runtime/local.

can anyone help? I'm using Nutch 1.5.1 and Cloudera CDH4 with hadoop 1.0.1.
thanks!

here's my nutch-site.xml

<configuration>
  <property>
    <name>plugin.includes</name>

<value>protocol-http|protocol-smb|protocol-file|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|htm
l|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
  </property>
  <property>
    <name>plugin.folders</name>

<value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib,classes/plugins</value>
  </property>
  <property>
    <name>http.content.limit</name>
    <value>-1</value>
  </property>
  <property>
    <name>db.max.outlinks.per.page</name>
    <value>-1</value>
  </property>
  <property>
    <name>fetcher.threads.fetch</name>
    <value>100</value>
  </property>
  <property>
    <name>fetcher.threads.per.queue</name>
    <value>100</value>
  </property>
</configuration>