You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jeff Maki <cr...@gmail.com> on 2007/09/06 22:04:18 UTC
Labeling URLs a-la Google
Hello everybody,
I'm working on a project that is essentially a searchable database for
academic citations at the University of Pittsburgh. One of our
searching requirements was to be able to break the search results into
sections--in order to do this, I implemented something similar to
Google's "labels".
It's based heavily on the example plugin, and maybe not so pretty
code-wise, but it's a start.
Downloadable here:
http://upclose.lrdc.pitt.edu/people/maki_assets/nutch-regex-label.tar.gz
You configure it by adding something like the below to your nutch-site.xml file:
<property>
<name>extension.regexlabeler.labels</name>
<value>
http://dev3\.informalscience\.org/research.*\.php.* = firsttag
secondtag thirdtag,
http://dev3\.informalscience\.org/project.*\.php.* = project,
http://www.?\.informalscience\.org.* = oldsite,
http://dev3\.informalscience\.org.* = devsite
</value>
</property>
Notes:
* Format of each line is <regular expression>=<labels, space delimited>
* URLS must be unique.
* Multiple tags for the same pattern are delimited by a space.
Hope this saves somebody some time,
-Jeff
(BTW, Nutch as worked very well for us--excellent project!)