You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mitch Baker <Mi...@iga.in.gov> on 2016/03/09 22:38:53 UTC

Only fetch 127.0.0.1:8080/*

I have small setup to index some files on a local box.  

Solr 5
Nutch 1.11

I thought I had it configured to not try any URLs that are not local to
the system but it still seems to look for them.

fetching http://www.cpsc.gov/Media/Documents/Regulations-Laws--Standards/Advisory-Opinions/Wheelchairs-145--/ (queue crawl delay=2000ms)
fetching http://www.cpsc.gov/PageFiles/121846/fuclearance.pdf (queue crawl delay=2000ms)
fetching http://www.cpsc.gov/Business--Manufacturing/Business-Education/Business-Guidance/Phthalates-Information/ (queue crawl delay=2000ms)
-activeThreads=150, spinWaiting=148, fetchQueues.totalSize=2091, fetchQueues.getQueueCount=1
fetching http://www.cpsc.gov/es/Research--Statistics/ (queue crawl delay=2000ms)

The regex-urlfilter.txt:

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# skip specific PDF files in the volumes directory
-.*00(FRONT|INTRO)\.PDF.*

#skip
#-^(http|https)://www\.*$
#-^(http|https)://blogs\.*$
#-^(http|https)://store\.*$
#-^(http|https)://.*\.google.com/.*$
#-^(http|https)://nist.gov/.*$


# accept anything else
#+.
+^http://127.0.0.1:8080/cocoon

I have searched and tried several things, including nutch-site.xml:

<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>1000</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts or domain
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  See 'db.ignore.external.links.mode'.
  </description>
</property>
<property>
  <name>db.max.outlinks.per.page</name>
  <value>0</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>
<property>
 <name>fetcher.max.crawl.delay</name>
 <value>3</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.
 </description>
</property>
<property>
  <name>fetcher.queue.mode</name>
  <value>byHost</value>
  <description>Determines how to put URLs into queues. Default value is 'byHost',
  also takes 'byDomain' or 'byIP'.
  </description>
</property>

<property>
  <name>fetcher.verbose</name>
  <value>false</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>


I inherited this and not that well versed on nutch.  Many hours of searching, and trying what I have found but still no luck.  Can't get it to just search  the local 
system http://127.0.0.1:8080/cocoon


And help would be greatly appreciated.

-- 
Mitch Baker <mi...@iga.in.gov>
LSA