You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jeff Maki <cr...@gmail.com> on 2007/09/06 22:04:18 UTC

Labeling URLs a-la Google

Hello everybody,

I'm working on a project that is essentially a searchable database for
academic citations at the University of Pittsburgh. One of our
searching requirements was to be able to break the search results into
sections--in order to do this, I implemented something similar to
Google's "labels".

It's based heavily on the example plugin, and maybe not so pretty
code-wise, but it's a start.

Downloadable here:
http://upclose.lrdc.pitt.edu/people/maki_assets/nutch-regex-label.tar.gz

You configure it by adding something like the below to your nutch-site.xml file:

<property>
  <name>extension.regexlabeler.labels</name>
  <value>
    http://dev3\.informalscience\.org/research.*\.php.* = firsttag
secondtag thirdtag,
    http://dev3\.informalscience\.org/project.*\.php.* = project,
    http://www.?\.informalscience\.org.* = oldsite,
    http://dev3\.informalscience\.org.* = devsite
  </value>
</property>

Notes:
* Format of each line is <regular expression>=<labels, space delimited>
* URLS must be unique.
* Multiple tags for the same pattern are delimited by a space.

Hope this saves somebody some time,

-Jeff

(BTW, Nutch as worked very well for us--excellent project!)