You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2015/07/16 03:00:07 UTC

[jira] [Resolved] (NUTCH-2058) Indexer plugin that allows RegEx replacements on the NutchDocument field values

     [ https://issues.apache.org/jira/browse/NUTCH-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved NUTCH-2058.
--------------------------------------

Committed! Thanks Peter!

{noformat}
[mattmann-0420740:~/tmp/nutch-trunk] mattmann% svn commit -m "Fix for NUTCH-2058: Indexer plugin that allows RegEx replacements on the NutchDocument field values contributed by PeterCiuffetti <pc...@astreetpress.com> this closes #44"
Sending        CHANGES.txt
Sending        build.xml
Sending        conf/nutch-default.xml
Sending        src/plugin/build.xml
Adding         src/plugin/index-replace
Adding         src/plugin/index-replace/README.txt
Adding         src/plugin/index-replace/build.xml
Adding         src/plugin/index-replace/ivy.xml
Adding         src/plugin/index-replace/plugin.xml
Adding         src/plugin/index-replace/sample
Adding         src/plugin/index-replace/sample/testIndexReplace.html
Adding         src/plugin/index-replace/src
Adding         src/plugin/index-replace/src/java
Adding         src/plugin/index-replace/src/java/org
Adding         src/plugin/index-replace/src/java/org/apache
Adding         src/plugin/index-replace/src/java/org/apache/nutch
Adding         src/plugin/index-replace/src/java/org/apache/nutch/indexer
Adding         src/plugin/index-replace/src/java/org/apache/nutch/indexer/replace
Adding         src/plugin/index-replace/src/java/org/apache/nutch/indexer/replace/FieldReplacer.java
Adding         src/plugin/index-replace/src/java/org/apache/nutch/indexer/replace/ReplaceIndexer.java
Adding         src/plugin/index-replace/src/java/org/apache/nutch/indexer/replace/package-info.java
Adding         src/plugin/index-replace/src/test
Adding         src/plugin/index-replace/src/test/org
Adding         src/plugin/index-replace/src/test/org/apache
Adding         src/plugin/index-replace/src/test/org/apache/nutch
Adding         src/plugin/index-replace/src/test/org/apache/nutch/indexer
Adding         src/plugin/index-replace/src/test/org/apache/nutch/indexer/replace
Adding         src/plugin/index-replace/src/test/org/apache/nutch/indexer/replace/TestIndexReplace.java
Adding         src/plugin/parse-replace
Adding         src/plugin/parse-replace/README.txt
Adding         src/plugin/parse-replace/build.xml
Adding         src/plugin/parse-replace/ivy.xml
Adding         src/plugin/parse-replace/plugin.xml
Adding         src/plugin/parse-replace/sample
Adding         src/plugin/parse-replace/sample/testParseReplace.html
Adding         src/plugin/parse-replace/src
Adding         src/plugin/parse-replace/src/java
Adding         src/plugin/parse-replace/src/java/org
Adding         src/plugin/parse-replace/src/java/org/apache
Adding         src/plugin/parse-replace/src/java/org/apache/nutch
Adding         src/plugin/parse-replace/src/java/org/apache/nutch/parse
Adding         src/plugin/parse-replace/src/java/org/apache/nutch/parse/replace
Adding         src/plugin/parse-replace/src/java/org/apache/nutch/parse/replace/ReplaceParser.java
Adding         src/plugin/parse-replace/src/java/org/apache/nutch/parse/replace/package-info.java
Adding         src/plugin/parse-replace/src/test
Adding         src/plugin/parse-replace/src/test/org
Adding         src/plugin/parse-replace/src/test/org/apache
Adding         src/plugin/parse-replace/src/test/org/apache/nutch
Adding         src/plugin/parse-replace/src/test/org/apache/nutch/parse
Adding         src/plugin/parse-replace/src/test/org/apache/nutch/parse/replace
Adding         src/plugin/parse-replace/src/test/org/apache/nutch/parse/replace/TestParseReplace.java
Transmitting file data .....................
Committed revision 1691298.
[mattmann-0420740:~/tmp/nutch-trunk] mattmann% 
{noformat}


> Indexer plugin that allows RegEx replacements on the NutchDocument field values
> -------------------------------------------------------------------------------
>
>                 Key: NUTCH-2058
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2058
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Peter Ciuffetti
>            Assignee: Chris A. Mattmann
>             Fix For: 1.11
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> This is the description of a IndexingFilter plugin I'm developing that allows regex replacements on field values prior to indexing to your search engine.
> *Plugin name*: index-replace
> *Property name*: index.replace.regexp
> *Use case example:*
> I'm indexing Nutch-created documents to a pre-existing SOLR core.  In this case I need to coerce the documents into the schema and field formats expected by the existing core.  The features of index-static and solrindex-mapping.xml get me most of the way.  Among other things, I need to generate identifiers from the web URLs.  So I need to do something like a regex replace on the id provided and then (with solrindex-mapping.xml) move this to the field name defined by the existing core.
> Another use case might be to refactor all URLs stored in the document so they route through a redirector gateway.
> The following is from the draft description in nutch-default.xml
> *Description:*
> Allows indexing-time regexp replace manipulation of metadata fields. The format of the property is a list of regexp replacements, one line per field being modified.  To use this property, add index-replace to your list of activated plugins.
>     
> *Example:*
> {code:xml}
> <property>
>   <name>index.replace.regexp</name>
>   <value>
>         fldname1=/regexp/replacement/flags
>         fldname2=/regexp/replacement/flags
>   </value>
> </property>
> {code}
> Field names would be one of those from https://wiki.apache.org/nutch/IndexStructure. The replacements will happen in the order listed. If a field needs multiple replacement operations they may be listed more than once.
> The *field name* precedes the equal sign.  The first character after the equal sign signifies the delimiter for the regexp, the replacement value and the flags.
> The *regexp* and the optional *flags* should correspond to Pattern.compile(String regexp, int flags) defined here: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile%28java.lang.String,%20int%29
> The *flags* is an integer sum of the flag values defined in http://docs.oracle.com/javase/7/docs/api/constant-values.html (Sec: java.util.regex.Pattern)
> Patterns are compiled when the plugin is initialized for efficiency.
> *Escaping*: since the regexp is being read from a config file, any escaped values must be double escaped.  Eg:  {code}
>   id=/\\s+//
> {code} will cause the escaped \s+ match pattern to be used.
> The *replacement* value should correspond to Java Matcher(CharSequence input).replaceAll(String replacement):  http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceAll%28java.lang.String%29
>     
> *Multi-valued Fields*
> If a field has multiple values, the replacement will be applied to each value in turn.
> *Non-string Datatypes*
> Replacement is possible only on String field datatypes.  If the field you name in the property is not a String datatype, it will be silently ignored.
> *Host and URL specific replacements*
> If the replacements should apply only to specifc pages, then add a sequence like
> {code}
>     hostmatch=hostmatchpattern
>     fld1=/regexp/replace/flags
>     fld2=/regexp/replace/flags
> {code}
>     or
> {code}
>     urlmatch=urlmatchpattern
>     fld1=/regexp/replace/flags
>     fld2=/regexp/replace/flags
> {code}
> When using Host and URL replacements, all replacements preceding the first hostmatch or urlmatch will apply to all Nutch documents.  Replacements following a hostmatch or urlmatch will be applied to Nutch documents that match the host or url field (up to the next hostmatch or urlmatch line).  hostmatch and urlmatch patterns must be unique in this property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)